Graph Transformers and Stabilized Reinforcement Learning for Large-Scale Dynamic Routing Modulation and Spectrum Allocation in Elastic Optical Networks

Alejandra Beghelli; Laura Toni; Michael Doherty

arxiv: 2605.02075 · v2 · pith:6GAPTUKInew · submitted 2026-05-03 · 💻 cs.NI

Graph Transformers and Stabilized Reinforcement Learning for Large-Scale Dynamic Routing Modulation and Spectrum Allocation in Elastic Optical Networks

Michael Doherty , Alejandra Beghelli , Laura Toni This is my paper

Pith reviewed 2026-05-20 23:56 UTC · model grok-4.3

classification 💻 cs.NI

keywords graph transformersreinforcement learningdynamic RMSAelastic optical networksrouting modulation spectrum allocationspectrum allocationnetwork optimizationlarge-scale networks

0 comments

The pith

A graph transformer trained via stabilized reinforcement learning supports up to 13 percent more traffic load than prior methods in large dynamic RMSA problems for elastic optical networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a transformer can be trained with reinforcement learning to solve dynamic routing, modulation and spectrum assignment on optical networks. Earlier RL approaches had not succeeded with transformers because of data demands and training instability. The authors combine rotary positional encodings for graphs, off-policy masking of invalid actions, valid mass regularization, and GPU-accelerated simulation to produce stable training. The resulting agent exceeds every previous RL and heuristic benchmark and scales to networks with 143 nodes and 362 links.

Core claim

By integrating rotary positional encodings for graph-structured data, off-policy invalid action masking, and valid mass regularization together with GPU-accelerated simulation, stable RL training of a transformer becomes possible for dynamic RMSA. This yields the first RL method that surpasses all benchmarks, increasing supportable traffic load by up to 13 percent and by up to 4 percent at blocking probabilities below 0.1 percent on real topologies up to 143 nodes.

What carries the argument

Graph transformer equipped with rotary positional encodings, trained under off-policy invalid action masking and valid mass regularization inside a GPU-accelerated simulator for dynamic RMSA decisions.

If this is right

Higher traffic loads can be carried on existing elastic optical networks before blocking becomes unacceptable.
The approach scales to the largest dynamic RMSA instances yet tackled by RL, including real topologies with hundreds of nodes.
Ablation results identify which training components most affect allocation quality and loss stability.
Open code release enables direct reproduction and extension on new network instances.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same stabilization recipe may transfer to other graph-structured resource allocation tasks that previously resisted transformer RL.
If blocking remains low at the reported loads, operators could defer costly capacity upgrades in spectrum-constrained links.
Further scaling tests on time-varying traffic patterns would show whether the learned policy remains robust beyond the evaluated static request models.

Load-bearing premise

The listed combination of rotary encodings, action masking, regularization, and fast simulation is what produces stable transformer training and superior RMSA performance.

What would settle it

Evaluating the trained agent on the same 143-node topologies and finding that its supported traffic load at low blocking probability does not exceed the best prior benchmark would falsify the performance claim.

Figures

Figures reproduced from arXiv: 2605.02075 by Alejandra Beghelli, Laura Toni, Michael Doherty.

**Figure 1.** Figure 1: Overview of the XLRON training architecture and algorithm. Parallel environments (green) on GPU generate experience from actions selected by the agent (blue), which comprises a Graph Transformer agent (light blue) trained with stabilized PPO. Key components of the learning algorithm (purple) include off-policy invalid action masking, valid mass stabilization, and WiRE positional encodings (red). L(θ) = Eˆ… view at source ↗

**Figure 2.** Figure 2: Wavelet-Induced Rotary Encoding (WiRE) for injecting graph positional information into the transformer. The network topology is converted to a line graph, spectral features are extracted from the Laplacian eigenvectors, and rotary position encodings are applied to the query and key vectors in each attention head. the bottom left view at source ↗

**Figure 3.** Figure 3: Actor-critic Graph Transformer architecture. The actor uses path pooling (concatenation of min, mean, max over path tokens) to rank candidate lightpaths, while the critic uses learned attention pooling over all tokens to estimate state value. H. Pooling and Readout Converting the transformed output tokens of the transformer into an action or value estimate requires an aggregation or pooling step followed… view at source ↗

**Figure 4.** Figure 4: Service blocking probability as a function of traffic load for our method (Transformer RL) compared to previous RL methods (Deep/Reward/GCN-RMSA, MaskRSA, PtrNet-RSA), the best heuristic in each, and upper bound network capacity estimates (defragmentation bound and cut-set bound) across NSFNET, COST239, USNET, and JPN48 topologies. Shaded regions indicate the standard error of the mean across parallel env… view at source ↗

**Figure 5.** Figure 5: The (a) TataInd and (b) USA100 network topologies used in the large-scale experiments. Node labels indicate node IDs and edge labels indicate link lengths in km. A. Heuristic Benchmarks For the TataInd and USA100 topologies, we systematically determine the strongest heuristic benchmark. Previous work [2] showed that FF-KSP and KSP-FF heuristics provided the lowest blocking probability compared to five ot… view at source ↗

**Figure 6.** Figure 6: Ablation study of key training components on TataInd and USA100 topologies. Each curve removes one component from the full model (All Features). The FF-KSP heuristic benchmark is shown for reference. Shaded regions indicate the upper and lower interquartile range across parallel environments. Removing the valid mass loss entirely (“No VML”) has differing effects in each case: on USA100 it causes a late tr… view at source ↗

**Figure 7.** Figure 7: Decomposition of the scalar magnitude of the total training loss into its constituent components (actor, valid mass, value, and entropy losses) over the course of training for TataInd and USA100 view at source ↗

**Figure 8.** Figure 8: Service blocking probability as a function of traffic load for the Transformer agent and FF-KSP heuristic on TataInd and USA100. Shaded regions indicate the standard error of the mean across parallel environments. sions without systematically preferring shorter or longer paths. On both topologies, the delta does not change appreciably as the network fills, suggesting a learned structural preference rather … view at source ↗

**Figure 10.** Figure 10: Mean path length in km (top) and hops (bottom) over the course of a single evaluation episode for the Transformer agent and FF-KSP heuristic on TataInd and USA100 view at source ↗

**Figure 11.** Figure 11: Difference in assigned path length (Transformer minus FF-KSP) in km (top) and hops (bottom) per traffic request for TataInd and USA100. Shaded regions indicate where one method selects shorter paths. natural to extend it to a multi-objective setting [44] or joint optimization of parameters such as launch power, where we expect its advantages to be even more pronounced. Having established that our method… view at source ↗

**Figure 14.** Figure 14: Difference in link usage between Transformer and FF-KSP for each link in TataInd and USA100. Positive values (green) indicate links that have more requests that use them allocated by the Transformer; negative values (purple) indicate links used less. “Rotary Position Encodings for Graphs,” (2026). ArXiv:2509.22259 [cs]. 10. R. Xiong, Y. Yang, D. He, K. Zheng, S. Zheng, C. Xing, H. Zhang, Y. Lan, L. Wang, … view at source ↗

**Figure 13.** Figure 13: Difference in frequency slot unit (FSU) occupancy between FF-KSP and Transformer across all links for TataInd and USA100. Green indicates higher occupancy by FF-KSP; purple indicates higher occupancy by the Transformer. (EP/R035342/1) REFERENCES 1. B. Jaumard, A. Mohammed, and Q. A. Nguyen, “Decomposition Models for the Routing and Slot Provisioning Problem,” in 2023 International Conference on Computi… view at source ↗

read the original abstract

Reinforcement learning (RL) has been widely applied to dynamic routing, modulation and spectrum assignment (RMSA) in optical networks, yet no prior work has trained a transformer model for this task. We attribute this to the high data and compute requirements of transformers and potential training instabilities with RL. We address this gap by combining recent advances from the machine learning literature (rotary positional encodings for graph-structured data, off-policy invalid action masking, and valid mass regularization) with GPU-accelerated simulation to achieve, for the first time, stable RL training of a transformer for dynamic RMSA. We demonstrate, through systematic benchmarking against previous RL methods and heuristic algorithms, that ours is the first RL method to exceed all benchmarks, increasing the supportable traffic load by up to 13%. To demonstrate the scalability of our approach, we train on real network topologies from the TopologyBench database up to 143 nodes and 362 links, with 320 x 12.5 GHz frequency slot units per link, and 100 Gbps traffic requests. To our knowledge, these are the largest dynamic RMSA problems to which RL has been applied. We find up to 4% increased traffic load can be supported at low blocking probability (<0.1%) with our method compared to the best available benchmark algorithm. We present an ablation study of the components of our training algorithm, the dynamics of the loss function during training, and analyze the allocation decisions of the trained models. We make all code used to produce this paper openly available for reproduction and future benchmarking: https://github.com/micdoh/XLRON.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gets a graph transformer stably trained with RL for dynamic RMSA at 143-node scale for the first time and reports gains over prior methods, but the 13% traffic improvement needs close checks on baseline tuning.

read the letter

The main point is that they have trained a transformer model with reinforcement learning for dynamic routing, modulation, and spectrum assignment in elastic optical networks, something no previous work achieved due to data demands and instability. They combine rotary positional encodings suited to graphs, off-policy invalid action masking, valid mass regularization, and GPU-accelerated simulation to stabilize training. This lets them handle real topologies up to 143 nodes and 362 links with 320 frequency slots per link and 100 Gbps requests, which they note is the largest scale for RL on this problem. They run systematic benchmarks against earlier RL approaches and heuristics, plus an ablation study on the stabilization pieces, and release all code on GitHub. That open resource and the scale demonstration are the clearest strengths here. The reported outcome is that their method supports up to 13% more traffic load than the best prior RL baseline and up to 4% more than the top heuristic at low blocking probability. The stress-test concern about unequal optimization effort on the baselines is worth watching. RL performance is sensitive to hyperparameter choices and training details, so if the older methods did not receive equivalent tuning or compute, part of the gap could trace to that rather than the new architecture alone. The abstract mentions the ablation and loss dynamics, which should help, but the full results section will need to show the exact controls and statistical spread. This work sits at the intersection of optical networking and applied RL. Readers working on practical network optimization or scaling RL to graph-structured control problems will find the implementation details and scalability numbers useful. It is coherent on its own terms and shows honest engagement with the literature through the comparisons and ablation. I would send it to peer review so referees can examine the baseline setups and statistical reporting in detail.

Referee Report

2 major / 1 minor

Summary. The paper introduces a graph transformer architecture trained via stabilized reinforcement learning for dynamic routing, modulation, and spectrum allocation (RMSA) in elastic optical networks. It combines rotary positional encodings for graphs, off-policy invalid action masking, and valid mass regularization with GPU-accelerated simulation to enable stable training, which prior work had not achieved for transformers on this task. Through systematic benchmarking on simulated and real topologies (up to 143 nodes), the authors report that their method is the first RL approach to exceed all prior RL and heuristic baselines, supporting up to 13% higher traffic load, with an additional 4% gain at low blocking probability; an ablation study, loss dynamics, and allocation analysis are included, along with open code.

Significance. If the central performance claims are confirmed under equitable baseline tuning, the work would mark a meaningful step in scaling modern sequence models to large dynamic network control problems. The explicit provision of open code, the scale of the evaluated topologies, and the inclusion of an ablation study are concrete strengths that support reproducibility and further progress in the area.

major comments (2)

[benchmarking section] Abstract and benchmarking section: the claim that the method is 'the first RL method to exceed all benchmarks, increasing the supportable traffic load by up to 13%' is load-bearing for the paper's contribution. The manuscript must demonstrate that the prior RL baselines received hyperparameter tuning and training resources comparable to those used for the proposed transformer (which benefits from the new stabilization techniques); without such details, the reported gap cannot be unambiguously attributed to the architectural and algorithmic advances rather than unequal optimization effort.
[ablation study section] Ablation study section: while the components (rotary encodings, invalid action masking, valid mass regularization) are listed as enabling stable training, the quantitative impact of each on training stability metrics (e.g., variance of returns or convergence speed) and on the final 13% gain should be reported with error bars across multiple random seeds to substantiate that the combination is sufficient for the claimed stability.

minor comments (1)

[experimental setup] Clarify in the methods whether the 320 frequency slots per link and 100 Gbps request sizes are fixed across all experiments or varied; this affects interpretation of the scalability results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of fair benchmarking and rigorous ablation analysis. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: Abstract and benchmarking section: the claim that the method is 'the first RL method to exceed all benchmarks, increasing the supportable traffic load by up to 13%' is load-bearing for the paper's contribution. The manuscript must demonstrate that the prior RL baselines received hyperparameter tuning and training resources comparable to those used for the proposed transformer (which benefits from the new stabilization techniques); without such details, the reported gap cannot be unambiguously attributed to the architectural and algorithmic advances rather than unequal optimization effort.

Authors: We agree that transparent documentation of baseline tuning is essential to attribute gains to the proposed graph transformer and stabilization methods. All baselines were re-implemented within the same GPU-accelerated simulator used for our approach, and we performed hyperparameter searches (grid and random) over learning rates, network sizes, and exploration parameters using equivalent total training steps and compute budgets. In the revised manuscript we will add an explicit subsection under benchmarking that tabulates the search ranges, final hyperparameters, and wall-clock training times for each baseline. The open-source repository already contains the exact configuration files and reproduction scripts, enabling independent verification that the reported 13% improvement is not an artifact of unequal optimization effort. revision: yes
Referee: Ablation study section: while the components (rotary encodings, invalid action masking, valid mass regularization) are listed as enabling stable training, the quantitative impact of each on training stability metrics (e.g., variance of returns or convergence speed) and on the final 13% gain should be reported with error bars across multiple random seeds to substantiate that the combination is sufficient for the claimed stability.

Authors: We accept that the current ablation study, while demonstrating necessity through failure modes when components are ablated, would be strengthened by quantitative stability metrics and statistical reporting. In the revision we will extend the ablation section with new runs across five random seeds, reporting mean and standard deviation (error bars) for return variance, convergence epoch, and final blocking probability. We will also include a table quantifying the contribution of each component to the overall performance gain relative to the full model. These additions will directly substantiate that the combination of rotary encodings, invalid-action masking, and valid-mass regularization is required for stable transformer training on this task. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results from empirical RL training and external benchmarking

full rationale

The paper's central claims rest on training a graph transformer RL agent for dynamic RMSA using GPU simulation, then measuring blocking probability and supportable traffic load on simulated topologies from TopologyBench. Performance gains (up to 13% vs. prior RL and heuristics) are reported directly from these evaluations and ablations rather than any derivation, fitted parameter renamed as prediction, or self-citation chain. The stabilization techniques are drawn from external ML literature; no equations or uniqueness theorems reduce the reported improvements to quantities defined by the target metric itself. The open code further supports independent verification against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard RL modeling of network states as MDPs, the effectiveness of cited stabilization techniques from external ML literature, and the fidelity of the GPU simulator to real optical network behavior.

axioms (1)

domain assumption Network dynamics for routing, modulation, and spectrum assignment can be accurately modeled as a Markov decision process for reinforcement learning.
Invoked implicitly when applying RL to the RMSA task.

pith-pipeline@v0.9.0 · 5830 in / 1270 out tokens · 41104 ms · 2026-05-20T23:56:39.764796+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We address this gap by combining recent advances from the machine learning literature (rotary positional encodings for graph-structured data, off-policy invalid action masking, and valid mass regularization) with GPU-accelerated simulation to achieve, for the first time, stable RL training of a transformer for dynamic RMSA.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We adopt Pre-LayerNorm and introduce off-policy invalid action masking (Section C) and valid mass stabilization (Section D) to prevent collapse while enabling effective feature learning.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 7 internal anchors

[1]

Decomposition Mod- els for the Routing and Slot Provisioning Problem,

B. Jaumard, A. Mohammed, and Q. A. Nguyen, “Decomposition Mod- els for the Routing and Slot Provisioning Problem, ” in 2023 Interna- tional Conference on Computing, Networking and Communications (ICNC), (2023), pp. 659–665

work page 2023
[2]

Rein- forcement learning for dynamic resource allocation in optical networks: hype or hope?

M. Doherty , R. Matzner, R. Sadeghi, P . Bayvel, and A. Beghelli, “Rein- forcement learning for dynamic resource allocation in optical networks: hype or hope?” J. Opt. Commun. Netw. 17, D1 (2025)

work page 2025
[3]

Attention Is All Y ou Need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention Is All Y ou Need, ” (2017). Version Number: 7

work page 2017
[4]

T ransformer-pointer DRL model for static resource allocation problems in SDM-EONs,

S. Chen, J. Wang, and M. Shigeno, “T ransformer-pointer DRL model for static resource allocation problems in SDM-EONs, ” J. Opt. Com- mun. Netw. 18, 315 (2026)

work page 2026
[5]

Stabilizing T ransformers for Re- inforcement Learning,

E. Parisotto, H. F . Song, J. W. Rae, R. Pascanu, C. Gulcehre, S. M. Jayakumar, M. Jaderberg, R. L. Kaufman, A. Clark, S. Noury , M. M. Botvinick, N. Heess, and R. Hadsell, “Stabilizing T ransformers for Re- inforcement Learning, ” (2019). ArXiv:1910.06764 [cs]

work page arXiv 2019
[6]

XLRON: Accelerated Learning and Resource Allocation for Optical Networks,

M. Doherty , “XLRON: Accelerated Learning and Resource Allocation for Optical Networks, ”https://github.com/micdoh/XLRON.git (2023)

work page 2023
[7]

XLRON: Accelerated Reinforcement Learning Environments for Optical Networks,

M. Doherty and A. Beghelli, “XLRON: Accelerated Reinforcement Learning Environments for Optical Networks, ” in 2024 Optical Fiber Communications Conference and Exhibition (OFC), (2024), pp. 1–3

work page 2024
[8]

Podracer architectures for scalable Reinforce- ment Learning,

M. Hessel, M. Kroiss, A. Clark, I. Kemaev, J. Quan, T . Keck, F . Viola, and H. van Hasselt, “Podracer architectures for scalable Reinforce- ment Learning, ” (2021). ArXiv:2104.06272 [cs]

work page arXiv 2021
[9]

Rotary Position Encodings for Graphs,

I. Reid, A. Sehanobish, C. HÃ ˝ ufs, B. Mlodozeniec, L. Vulpius, F . Bar- bero, A. Weller, K. Choromanski, R. E. T urner, and P . VeliÄ koviÄ ˘G, Fig. 14. Difference in link usage between Transformer and FF-KSP for each link in TataInd and USA100. Positive values (green) indicate links that have more requests that use them allocated by the Transformer; ne...

work page arXiv 2026
[10]

arXiv:2002.04745 [cs, stat] , author =

R. Xiong, Y . Y ang, D. He, K. Zheng, S. Zheng, C. Xing, H. Zhang, Y . Lan, L. Wang, and T .-Y . Liu, “On Layer Normalization in the T rans- former Architecture, ” (2020). ArXiv:2002.04745 [cs]

work page arXiv 2020
[11]

DeepRMSA: A Deep Reinforcement Learning Framework for Routing, Modulation and Spectrum Assignment in Elastic Optical Networks,

X. Chen, B. Li, R. Proietti, H. Lu, Z. Zhu, and S. J. B. Y oo, “DeepRMSA: A Deep Reinforcement Learning Framework for Routing, Modulation and Spectrum Assignment in Elastic Optical Networks, ” J. Light. T ech- nol. 37, 4155–4163 (2019)

work page 2019
[12]

Heuristic Reward De- sign for Deep Reinforcement Learning-Based Routing, Modulation and Spectrum Assignment of Elastic Optical Networks,

B. T ang, Y .-C. Huang, Y . Xue, and W. Zhou, “Heuristic Reward De- sign for Deep Reinforcement Learning-Based Routing, Modulation and Spectrum Assignment of Elastic Optical Networks, ” IEEE Com- mun. Lett. 26, 2675–2679 (2022)

work page 2022
[13]

Deep Reinforcement Learning- Based Routing and Spectrum Assignment of EONs by Exploiting GCN and RNN for Feature Extraction,

L. Xu, Y .-C. Huang, Y . Xue, and X. Hu, “Deep Reinforcement Learning- Based Routing and Spectrum Assignment of EONs by Exploiting GCN and RNN for Feature Extraction, ” J. Light. T echnol. 40, 4945–4955 (2022)

work page 2022
[14]

Mask RSA: End-T o-End Reinforcement Learning-based Routing and Spectrum Assignment in Elastic Optical Networks,

M. Shimoda and T . T anaka, “Mask RSA: End-T o-End Reinforcement Learning-based Routing and Spectrum Assignment in Elastic Optical Networks, ” in2021 European Conference on Optical Communication (ECOC), (IEEE, Bordeaux, France, 2021), pp. 1–4

work page 2021
[15]

PtrNet-RSA: A Pointer Network-based QoT-aware Routing and Spectrum Assign- ment Scheme in Elastic Optical Networks,

Y . Cheng, S. Ding, Y . Shao, and C.-K. Chan, “PtrNet-RSA: A Pointer Network-based QoT-aware Routing and Spectrum Assign- ment Scheme in Elastic Optical Networks, ” J. Light. T echnol. pp. 1–12 (2024)

work page 2024
[16]

T opology Bench: Systematic Graph Based Benchmarking for Core Optical Networks,

R. Matzner, A. Ahuja, R. Sadeghi, M. Doherty , A. Beghelli, S. J. Savory , and P . Bayvel, “T opology Bench: Systematic Graph Based Benchmarking for Core Optical Networks, ” (2024). Version Number: 1

work page 2024
[17]

A multicast reinforcement learn- ing algorithm for WDM optical networks,

P . Garcia, A. Zsigri, and A. Guitton, “A multicast reinforcement learn- ing algorithm for WDM optical networks, ” in Proceedings of the 7th In- ternational Conference on T elecommunications, 2003. ConTEL 2003., (IEEE, Zagreb, Croatia, 2003), pp. 419–426 vol.2

work page 2003
[18]

Action Guidance: Getting the Best of Sparse Rewards and Shaped Rewards for Real-time Strategy Games,

S. Huang and S. Ontañón, “Action Guidance: Getting the Best of Sparse Rewards and Shaped Rewards for Real-time Strategy Games, ” (2020). ArXiv:2010.03956 [cs, stat]

work page arXiv 2020
[19]

Deep-reinforcement- learning-based RMSCA for space division multiplexing networks with multi-core ﬁbers [Invited T utorial],

Y . T eng, C. Natalino, H. Li, R. Y ang, J. Majeed, S. Shen, P . Monti, R. Nejabati, S. Y an, and D. Simeonidou, “Deep-reinforcement- learning-based RMSCA for space division multiplexing networks with multi-core ﬁbers [Invited T utorial], ” J. Opt. Commun. Netw. 16, C76 (2024)

work page 2024
[20]

DRL-Assisted QoT-Aware Ser- vice Provisioning in Multi-Band Elastic Optical Networks,

Y . T eng, C. Natalino, F . Arpanaei, H. Li, A. SÃ ˛ anchez-MaciÃ ˛ an, P . Monti, S. Y an, and D. Simeonidou, “DRL-Assisted QoT-Aware Ser- vice Provisioning in Multi-Band Elastic Optical Networks, ” J. Light. T echnol.43, 9090–9101 (2025)

work page 2025
[21]

Physical layer-aware deep reinforcement learning with advantage function stabilization for dynamic RMSA in elastic optical networks,

H. Wang, Y . Wang, Y . Zhao, and J. Zhang, “Physical layer-aware deep reinforcement learning with advantage function stabilization for dynamic RMSA in elastic optical networks, ” J. Opt. Commun. Netw. 18, 250 (2026)

work page 2026
[22]

Proximal Policy Optimization Algorithms

J. Schulman, F . Wolski, P . Dhariwal, A. Radford, and O. Klimov, “Prox- Research Article 13 imal Policy Optimization Algorithms, ” (2017). ArXiv:1707.06347 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2017
[23]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

J. Schulman, P . Moritz, S. Levine, M. Jordan, and P . Abbeel, “High- Dimensional Continuous Control Using Generalized Advantage Esti- mation, ” (2018). ArXiv:1506.02438 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2018
[24]

Reward Centering,

A. Naik, Y . Wan, M. T omar, and R. S. Sutton, “Reward Centering, ” (2024). ArXiv:2405.09999 [cs]

work page arXiv 2024
[25]

Overcoming Valid Action Suppression in Unmasked Pol- icy Gradient Algorithms,

R. Zabounidis, R. Siegelmann, M. Qadri, W. Kim, S. Stepputtis, and K. P . Sycara, “Overcoming Valid Action Suppression in Unmasked Pol- icy Gradient Algorithms, ” (2026). ArXiv:2603.09090 [cs]

work page arXiv 2026
[26]

Exploring the Use of Invalid Action Masking in Reinforcement Learning: A Com- parative Study of On-Policy and Off-Policy Algorithms in Real-Time Strategy Games,

Y . Hou, X. Liang, J. Zhang, Q. Y ang, A. Y ang, and N. Wang, “Exploring the Use of Invalid Action Masking in Reinforcement Learning: A Com- parative Study of On-Policy and Off-Policy Algorithms in Real-Time Strategy Games, ” Appl. Sci.13, 8283 (2023)

work page 2023
[27]

Decision Transformer: Reinforcement Learning via Sequence Modeling

L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M. Laskin, P . Abbeel, A. Srinivas, and I. Mordatch, “Decision T ransformer: Reinforcement Learning via Sequence Modeling, ” (2021). ArXiv:2106.01345 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2021
[28]

Layer Normalization

J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization, ” (2016). ArXiv:1607.06450 [cs, stat]

work page internal anchor Pith review Pith/arXiv arXiv 2016
[29]

Graph Attention Networks

P . VeliÄ koviÄ ˘G, G. Cucurull, A. Casanova, A. Romero, P . LiÃš, and Y . Bengio, “Graph Attention Networks, ” (2018). ArXiv:1710.10903 [stat]

work page internal anchor Pith review Pith/arXiv arXiv 2018
[30]

How Attentive are Graph Attention Networks?

S. Brody , U. Alon, and E. Y ahav, “How Attentive are Graph Attention Networks?” (2022). ArXiv:2105.14491 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2022
[31]

Graph Attention Network En- hanced Deep Reinforcement Learning Framework for Routing, Mod- ulation, and Spectrum Allocation in EONs,

Z. Xiong, Y .-C. Huang, and X. Hu, “Graph Attention Network En- hanced Deep Reinforcement Learning Framework for Routing, Mod- ulation, and Spectrum Allocation in EONs, ” in 2024 Asia Communica- tions and Photonics Conference (ACP) and International Conference on Information Photonics and Optical Communications (IPOC), (IEEE, Beijing, China, 2024), pp. 1–6

work page 2024
[32]

Do transformers really perform bad for graph representation?, 2021

C. Ying, T . Cai, S. Luo, S. Zheng, G. Ke, D. He, Y . Shen, and T .-Y . Liu, “Do T ransformers Really Perform Bad for Graph Representation?” (2021). ArXiv:2106.05234 [cs]

work page arXiv 2021
[33]

Graph Inductive Biases in T ransformers without Message Passing,

L. Ma, C. Lin, D. Lim, A. Romero-Soriano, P . K. Dokania, M. Coates, P . T orr, and S.-N. Lim, “Graph Inductive Biases in T ransformers without Message Passing, ” (2023). ArXiv:2305.17589 [cs]

work page arXiv 2023
[34]

Comparing Graph T ransformers via Positional Encodings,

M. Black, Z. Wan, G. Mishne, A. Nayyeri, and Y . Wang, “Comparing Graph T ransformers via Positional Encodings, ” (2024). ArXiv:2402.14202 [cs]

work page arXiv 2024
[35]

RoFormer: Enhanced Transformer with Rotary Position Embedding

J. Su, Y . Lu, S. Pan, A. Murtadha, B. Wen, and Y . Liu, “Ro- Former: Enhanced T ransformer with Rotary Position Embedding, ” (2023). ArXiv:2104.09864 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

Pool me wisely: Rethinking graph pooling in graph transformers,

S. Ennadir, M. Vazirgiannis, and R. Liao, “Pool me wisely: Rethinking graph pooling in graph transformers, ” (2025). ArXiv:2502.11032

work page arXiv 2025
[37]

Pointer Networks,

O. Vinyals, M. Fortunato, and N. Jaitly , “Pointer Networks, ” (2015). Ver- sion Number: 2

work page 2015
[38]

Cost-effective network capac- ity upgrade by heterogeneous wavelength division multiplexing density with bandwidth-variable virtual direct links,

K. Hayashi, Y . Mori, and H. Hasegawa, “Cost-effective network capac- ity upgrade by heterogeneous wavelength division multiplexing density with bandwidth-variable virtual direct links, ” J. Opt. Commun. Netw.15, D23–D32 (2023)

work page 2023
[39]

Effective Capacity Estimation Based on Cut-Set Load Analysis in Optical Path Networks,

K. Cruzado, Y . Mori, S.-C. Lin, M. Matsuura, S. Subramaniam, and H. Hasegawa, “Effective Capacity Estimation Based on Cut-Set Load Analysis in Optical Path Networks, ” in 2023 International Conference on Photonics in Switching and Computing (PSC), (2023), pp. 1–3

work page 2023
[40]

Capacity-Bound Evaluation and Routing and Spec- trum Assignment for Elastic Optical Path Networks with Distance- Adaptive Modulation,

K. Cruzado, Y . Mori, S.-C. Lin, M. Matsuura, S. Subramaniam, and H. Hasegawa, “Capacity-Bound Evaluation and Routing and Spec- trum Assignment for Elastic Optical Path Networks with Distance- Adaptive Modulation, ” in2024 Optical Fiber Communications Confer- ence and Exhibition (OFC), (2024), pp. 1–3

work page 2024
[41]

Routing and wavelength allocation in WDM optical net- works,

S. Baroni, “Routing and wavelength allocation in WDM optical net- works, ” Ph.D. thesis, University College London, United Kingdom (1998)

work page 1998
[42]

Resource allocation and scalability in dynamic wavelength-routed optical networks,

A. Beghelli, “Resource allocation and scalability in dynamic wavelength-routed optical networks, ” Ph.D. thesis, University of Lon- don (2006)

work page 2006
[43]

Staggered Environment Resets Im- prove Massively Parallel On-Policy Reinforcement Learning,

S. Bharthulwar, S. T ao, and H. Su, “Staggered Environment Resets Im- prove Massively Parallel On-Policy Reinforcement Learning, ” (2025). ArXiv:2511.21011 [cs]

work page arXiv 2025
[44]

Interpreting multi-objective reinforcement learning for routing and wavelength assignment in optical networks,

S. Nallaperuma, Z. Gan, J. Nevin, M. Shevchenko, and S. J. Sa- vory , “Interpreting multi-objective reinforcement learning for routing and wavelength assignment in optical networks, ” J. Opt. Commun. Netw. 15, 497 (2023)

work page 2023

[1] [1]

Decomposition Mod- els for the Routing and Slot Provisioning Problem,

B. Jaumard, A. Mohammed, and Q. A. Nguyen, “Decomposition Mod- els for the Routing and Slot Provisioning Problem, ” in 2023 Interna- tional Conference on Computing, Networking and Communications (ICNC), (2023), pp. 659–665

work page 2023

[2] [2]

Rein- forcement learning for dynamic resource allocation in optical networks: hype or hope?

M. Doherty , R. Matzner, R. Sadeghi, P . Bayvel, and A. Beghelli, “Rein- forcement learning for dynamic resource allocation in optical networks: hype or hope?” J. Opt. Commun. Netw. 17, D1 (2025)

work page 2025

[3] [3]

Attention Is All Y ou Need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention Is All Y ou Need, ” (2017). Version Number: 7

work page 2017

[4] [4]

T ransformer-pointer DRL model for static resource allocation problems in SDM-EONs,

S. Chen, J. Wang, and M. Shigeno, “T ransformer-pointer DRL model for static resource allocation problems in SDM-EONs, ” J. Opt. Com- mun. Netw. 18, 315 (2026)

work page 2026

[5] [5]

Stabilizing T ransformers for Re- inforcement Learning,

E. Parisotto, H. F . Song, J. W. Rae, R. Pascanu, C. Gulcehre, S. M. Jayakumar, M. Jaderberg, R. L. Kaufman, A. Clark, S. Noury , M. M. Botvinick, N. Heess, and R. Hadsell, “Stabilizing T ransformers for Re- inforcement Learning, ” (2019). ArXiv:1910.06764 [cs]

work page arXiv 2019

[6] [6]

XLRON: Accelerated Learning and Resource Allocation for Optical Networks,

M. Doherty , “XLRON: Accelerated Learning and Resource Allocation for Optical Networks, ”https://github.com/micdoh/XLRON.git (2023)

work page 2023

[7] [7]

XLRON: Accelerated Reinforcement Learning Environments for Optical Networks,

M. Doherty and A. Beghelli, “XLRON: Accelerated Reinforcement Learning Environments for Optical Networks, ” in 2024 Optical Fiber Communications Conference and Exhibition (OFC), (2024), pp. 1–3

work page 2024

[8] [8]

Podracer architectures for scalable Reinforce- ment Learning,

M. Hessel, M. Kroiss, A. Clark, I. Kemaev, J. Quan, T . Keck, F . Viola, and H. van Hasselt, “Podracer architectures for scalable Reinforce- ment Learning, ” (2021). ArXiv:2104.06272 [cs]

work page arXiv 2021

[9] [9]

Rotary Position Encodings for Graphs,

I. Reid, A. Sehanobish, C. HÃ ˝ ufs, B. Mlodozeniec, L. Vulpius, F . Bar- bero, A. Weller, K. Choromanski, R. E. T urner, and P . VeliÄ koviÄ ˘G, Fig. 14. Difference in link usage between Transformer and FF-KSP for each link in TataInd and USA100. Positive values (green) indicate links that have more requests that use them allocated by the Transformer; ne...

work page arXiv 2026

[10] [10]

arXiv:2002.04745 [cs, stat] , author =

R. Xiong, Y . Y ang, D. He, K. Zheng, S. Zheng, C. Xing, H. Zhang, Y . Lan, L. Wang, and T .-Y . Liu, “On Layer Normalization in the T rans- former Architecture, ” (2020). ArXiv:2002.04745 [cs]

work page arXiv 2020

[11] [11]

DeepRMSA: A Deep Reinforcement Learning Framework for Routing, Modulation and Spectrum Assignment in Elastic Optical Networks,

X. Chen, B. Li, R. Proietti, H. Lu, Z. Zhu, and S. J. B. Y oo, “DeepRMSA: A Deep Reinforcement Learning Framework for Routing, Modulation and Spectrum Assignment in Elastic Optical Networks, ” J. Light. T ech- nol. 37, 4155–4163 (2019)

work page 2019

[12] [12]

Heuristic Reward De- sign for Deep Reinforcement Learning-Based Routing, Modulation and Spectrum Assignment of Elastic Optical Networks,

B. T ang, Y .-C. Huang, Y . Xue, and W. Zhou, “Heuristic Reward De- sign for Deep Reinforcement Learning-Based Routing, Modulation and Spectrum Assignment of Elastic Optical Networks, ” IEEE Com- mun. Lett. 26, 2675–2679 (2022)

work page 2022

[13] [13]

Deep Reinforcement Learning- Based Routing and Spectrum Assignment of EONs by Exploiting GCN and RNN for Feature Extraction,

L. Xu, Y .-C. Huang, Y . Xue, and X. Hu, “Deep Reinforcement Learning- Based Routing and Spectrum Assignment of EONs by Exploiting GCN and RNN for Feature Extraction, ” J. Light. T echnol. 40, 4945–4955 (2022)

work page 2022

[14] [14]

Mask RSA: End-T o-End Reinforcement Learning-based Routing and Spectrum Assignment in Elastic Optical Networks,

M. Shimoda and T . T anaka, “Mask RSA: End-T o-End Reinforcement Learning-based Routing and Spectrum Assignment in Elastic Optical Networks, ” in2021 European Conference on Optical Communication (ECOC), (IEEE, Bordeaux, France, 2021), pp. 1–4

work page 2021

[15] [15]

PtrNet-RSA: A Pointer Network-based QoT-aware Routing and Spectrum Assign- ment Scheme in Elastic Optical Networks,

Y . Cheng, S. Ding, Y . Shao, and C.-K. Chan, “PtrNet-RSA: A Pointer Network-based QoT-aware Routing and Spectrum Assign- ment Scheme in Elastic Optical Networks, ” J. Light. T echnol. pp. 1–12 (2024)

work page 2024

[16] [16]

T opology Bench: Systematic Graph Based Benchmarking for Core Optical Networks,

R. Matzner, A. Ahuja, R. Sadeghi, M. Doherty , A. Beghelli, S. J. Savory , and P . Bayvel, “T opology Bench: Systematic Graph Based Benchmarking for Core Optical Networks, ” (2024). Version Number: 1

work page 2024

[17] [17]

A multicast reinforcement learn- ing algorithm for WDM optical networks,

P . Garcia, A. Zsigri, and A. Guitton, “A multicast reinforcement learn- ing algorithm for WDM optical networks, ” in Proceedings of the 7th In- ternational Conference on T elecommunications, 2003. ConTEL 2003., (IEEE, Zagreb, Croatia, 2003), pp. 419–426 vol.2

work page 2003

[18] [18]

Action Guidance: Getting the Best of Sparse Rewards and Shaped Rewards for Real-time Strategy Games,

S. Huang and S. Ontañón, “Action Guidance: Getting the Best of Sparse Rewards and Shaped Rewards for Real-time Strategy Games, ” (2020). ArXiv:2010.03956 [cs, stat]

work page arXiv 2020

[19] [19]

Deep-reinforcement- learning-based RMSCA for space division multiplexing networks with multi-core ﬁbers [Invited T utorial],

Y . T eng, C. Natalino, H. Li, R. Y ang, J. Majeed, S. Shen, P . Monti, R. Nejabati, S. Y an, and D. Simeonidou, “Deep-reinforcement- learning-based RMSCA for space division multiplexing networks with multi-core ﬁbers [Invited T utorial], ” J. Opt. Commun. Netw. 16, C76 (2024)

work page 2024

[20] [20]

DRL-Assisted QoT-Aware Ser- vice Provisioning in Multi-Band Elastic Optical Networks,

Y . T eng, C. Natalino, F . Arpanaei, H. Li, A. SÃ ˛ anchez-MaciÃ ˛ an, P . Monti, S. Y an, and D. Simeonidou, “DRL-Assisted QoT-Aware Ser- vice Provisioning in Multi-Band Elastic Optical Networks, ” J. Light. T echnol.43, 9090–9101 (2025)

work page 2025

[21] [21]

Physical layer-aware deep reinforcement learning with advantage function stabilization for dynamic RMSA in elastic optical networks,

H. Wang, Y . Wang, Y . Zhao, and J. Zhang, “Physical layer-aware deep reinforcement learning with advantage function stabilization for dynamic RMSA in elastic optical networks, ” J. Opt. Commun. Netw. 18, 250 (2026)

work page 2026

[22] [22]

Proximal Policy Optimization Algorithms

J. Schulman, F . Wolski, P . Dhariwal, A. Radford, and O. Klimov, “Prox- Research Article 13 imal Policy Optimization Algorithms, ” (2017). ArXiv:1707.06347 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2017

[23] [23]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

J. Schulman, P . Moritz, S. Levine, M. Jordan, and P . Abbeel, “High- Dimensional Continuous Control Using Generalized Advantage Esti- mation, ” (2018). ArXiv:1506.02438 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2018

[24] [24]

Reward Centering,

A. Naik, Y . Wan, M. T omar, and R. S. Sutton, “Reward Centering, ” (2024). ArXiv:2405.09999 [cs]

work page arXiv 2024

[25] [25]

Overcoming Valid Action Suppression in Unmasked Pol- icy Gradient Algorithms,

R. Zabounidis, R. Siegelmann, M. Qadri, W. Kim, S. Stepputtis, and K. P . Sycara, “Overcoming Valid Action Suppression in Unmasked Pol- icy Gradient Algorithms, ” (2026). ArXiv:2603.09090 [cs]

work page arXiv 2026

[26] [26]

Exploring the Use of Invalid Action Masking in Reinforcement Learning: A Com- parative Study of On-Policy and Off-Policy Algorithms in Real-Time Strategy Games,

Y . Hou, X. Liang, J. Zhang, Q. Y ang, A. Y ang, and N. Wang, “Exploring the Use of Invalid Action Masking in Reinforcement Learning: A Com- parative Study of On-Policy and Off-Policy Algorithms in Real-Time Strategy Games, ” Appl. Sci.13, 8283 (2023)

work page 2023

[27] [27]

Decision Transformer: Reinforcement Learning via Sequence Modeling

L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M. Laskin, P . Abbeel, A. Srinivas, and I. Mordatch, “Decision T ransformer: Reinforcement Learning via Sequence Modeling, ” (2021). ArXiv:2106.01345 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2021

[28] [28]

Layer Normalization

J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization, ” (2016). ArXiv:1607.06450 [cs, stat]

work page internal anchor Pith review Pith/arXiv arXiv 2016

[29] [29]

Graph Attention Networks

P . VeliÄ koviÄ ˘G, G. Cucurull, A. Casanova, A. Romero, P . LiÃš, and Y . Bengio, “Graph Attention Networks, ” (2018). ArXiv:1710.10903 [stat]

work page internal anchor Pith review Pith/arXiv arXiv 2018

[30] [30]

How Attentive are Graph Attention Networks?

S. Brody , U. Alon, and E. Y ahav, “How Attentive are Graph Attention Networks?” (2022). ArXiv:2105.14491 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2022

[31] [31]

Graph Attention Network En- hanced Deep Reinforcement Learning Framework for Routing, Mod- ulation, and Spectrum Allocation in EONs,

Z. Xiong, Y .-C. Huang, and X. Hu, “Graph Attention Network En- hanced Deep Reinforcement Learning Framework for Routing, Mod- ulation, and Spectrum Allocation in EONs, ” in 2024 Asia Communica- tions and Photonics Conference (ACP) and International Conference on Information Photonics and Optical Communications (IPOC), (IEEE, Beijing, China, 2024), pp. 1–6

work page 2024

[32] [32]

Do transformers really perform bad for graph representation?, 2021

C. Ying, T . Cai, S. Luo, S. Zheng, G. Ke, D. He, Y . Shen, and T .-Y . Liu, “Do T ransformers Really Perform Bad for Graph Representation?” (2021). ArXiv:2106.05234 [cs]

work page arXiv 2021

[33] [33]

Graph Inductive Biases in T ransformers without Message Passing,

L. Ma, C. Lin, D. Lim, A. Romero-Soriano, P . K. Dokania, M. Coates, P . T orr, and S.-N. Lim, “Graph Inductive Biases in T ransformers without Message Passing, ” (2023). ArXiv:2305.17589 [cs]

work page arXiv 2023

[34] [34]

Comparing Graph T ransformers via Positional Encodings,

M. Black, Z. Wan, G. Mishne, A. Nayyeri, and Y . Wang, “Comparing Graph T ransformers via Positional Encodings, ” (2024). ArXiv:2402.14202 [cs]

work page arXiv 2024

[35] [35]

RoFormer: Enhanced Transformer with Rotary Position Embedding

J. Su, Y . Lu, S. Pan, A. Murtadha, B. Wen, and Y . Liu, “Ro- Former: Enhanced T ransformer with Rotary Position Embedding, ” (2023). ArXiv:2104.09864 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2023

[36] [36]

Pool me wisely: Rethinking graph pooling in graph transformers,

S. Ennadir, M. Vazirgiannis, and R. Liao, “Pool me wisely: Rethinking graph pooling in graph transformers, ” (2025). ArXiv:2502.11032

work page arXiv 2025

[37] [37]

Pointer Networks,

O. Vinyals, M. Fortunato, and N. Jaitly , “Pointer Networks, ” (2015). Ver- sion Number: 2

work page 2015

[38] [38]

Cost-effective network capac- ity upgrade by heterogeneous wavelength division multiplexing density with bandwidth-variable virtual direct links,

K. Hayashi, Y . Mori, and H. Hasegawa, “Cost-effective network capac- ity upgrade by heterogeneous wavelength division multiplexing density with bandwidth-variable virtual direct links, ” J. Opt. Commun. Netw.15, D23–D32 (2023)

work page 2023

[39] [39]

Effective Capacity Estimation Based on Cut-Set Load Analysis in Optical Path Networks,

K. Cruzado, Y . Mori, S.-C. Lin, M. Matsuura, S. Subramaniam, and H. Hasegawa, “Effective Capacity Estimation Based on Cut-Set Load Analysis in Optical Path Networks, ” in 2023 International Conference on Photonics in Switching and Computing (PSC), (2023), pp. 1–3

work page 2023

[40] [40]

Capacity-Bound Evaluation and Routing and Spec- trum Assignment for Elastic Optical Path Networks with Distance- Adaptive Modulation,

K. Cruzado, Y . Mori, S.-C. Lin, M. Matsuura, S. Subramaniam, and H. Hasegawa, “Capacity-Bound Evaluation and Routing and Spec- trum Assignment for Elastic Optical Path Networks with Distance- Adaptive Modulation, ” in2024 Optical Fiber Communications Confer- ence and Exhibition (OFC), (2024), pp. 1–3

work page 2024

[41] [41]

Routing and wavelength allocation in WDM optical net- works,

S. Baroni, “Routing and wavelength allocation in WDM optical net- works, ” Ph.D. thesis, University College London, United Kingdom (1998)

work page 1998

[42] [42]

Resource allocation and scalability in dynamic wavelength-routed optical networks,

A. Beghelli, “Resource allocation and scalability in dynamic wavelength-routed optical networks, ” Ph.D. thesis, University of Lon- don (2006)

work page 2006

[43] [43]

Staggered Environment Resets Im- prove Massively Parallel On-Policy Reinforcement Learning,

S. Bharthulwar, S. T ao, and H. Su, “Staggered Environment Resets Im- prove Massively Parallel On-Policy Reinforcement Learning, ” (2025). ArXiv:2511.21011 [cs]

work page arXiv 2025

[44] [44]

Interpreting multi-objective reinforcement learning for routing and wavelength assignment in optical networks,

S. Nallaperuma, Z. Gan, J. Nevin, M. Shevchenko, and S. J. Sa- vory , “Interpreting multi-objective reinforcement learning for routing and wavelength assignment in optical networks, ” J. Opt. Commun. Netw. 15, 497 (2023)

work page 2023