Recognition: 3 theorem links
· Lean TheoremGraph Transformers and Stabilized Reinforcement Learning for Large-Scale Dynamic Routing Modulation and Spectrum Allocation in Elastic Optical Networks
Pith reviewed 2026-05-08 18:42 UTC · model grok-4.3
The pith
Stabilized reinforcement learning enables the first graph transformer to exceed benchmarks on large-scale dynamic routing and spectrum allocation in optical networks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that a graph transformer policy, trained end-to-end with stabilized off-policy reinforcement learning inside a GPU-accelerated simulator, produces the first RL agent for dynamic RMSA that outperforms all published benchmarks, increasing the supportable traffic load by as much as 13 percent and by 4 percent at blocking probabilities below 0.1 percent on networks up to 143 nodes and 362 links.
What carries the argument
A graph transformer policy whose training is stabilized by rotary positional encodings on graph-structured observations, off-policy invalid action masking, and valid mass regularization, executed inside a GPU-accelerated network simulator.
Load-bearing premise
The GPU simulator accurately reproduces the dynamics, physical impairments, and traffic patterns of real elastic optical networks so that policies learned in simulation transfer to unseen traffic and topologies.
What would settle it
Run the trained policy on a live or emulated optical network using traffic traces recorded from an operational network different from those used in training and measure whether the reported capacity gains and blocking rates are reproduced.
Figures
read the original abstract
Reinforcement learning (RL) has been widely applied to dynamic routing, modulation and spectrum assignment (RMSA) in optical networks, yet no prior work has trained a transformer model for this task. We attribute this to the high data and compute requirements of transformers and potential training instabilities with RL. We address this gap by combining recent advances from the machine learning literature (rotary positional encodings for graph-structured data, off-policy invalid action masking, and valid mass regularization) with GPU-accelerated simulation to achieve, for the first time, stable RL training of a transformer for dynamic RMSA. We demonstrate, through systematic benchmarking against previous RL methods and heuristic algorithms, that ours is the first RL method to exceed all benchmarks, increasing the supportable traffic load by up to 13\%. To demonstrate the scalability of our approach, we train on real network topologies from the TopologyBench database up to 143 nodes and 362 links, with 320 x 12.5\,GHz frequency slot units per link, and 100\,Gbps traffic requests. To our knowledge, these are the largest dynamic RMSA problems to which RL has been applied. We find up to 4\% increased traffic load can be supported at low blocking probability (<0.1\%) with our method compared to the best available benchmark algorithm. We present an ablation study of the components of our training algorithm, the dynamics of the loss function during training, and analyze the allocation decisions of the trained models. We make all code used to produce this paper openly available for reproduction and future benchmarking: https://github.com/micdoh/XLRON.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces a graph transformer architecture trained via stabilized reinforcement learning (using rotary positional encodings, off-policy invalid action masking, and valid mass regularization) for dynamic routing, modulation, and spectrum allocation (RMSA) in elastic optical networks. It claims to be the first RL method to successfully train such a transformer, achieving stable training at scale on real topologies up to 143 nodes and 362 links with 320 frequency slots per link. Systematic benchmarking against prior RL methods and heuristics shows up to 13% higher supportable traffic load (and 4% at blocking probability <0.1%), supported by ablation studies, loss dynamics, decision analysis, and fully open code.
Significance. If the results hold, the work is significant for demonstrating that graph transformers can be stably trained with RL for large-scale dynamic network optimization problems previously limited by data/compute requirements and instability. The open code, ablation study, and analysis of allocation decisions provide concrete strengths for reproducibility and extension. The scalability to 143-node instances from TopologyBench is a notable advance over prior RL applications in this domain.
major comments (2)
- [Abstract and Evaluation sections] The headline performance claims (up to 13% increase in supportable traffic load) rest entirely on results from the authors' custom GPU-accelerated simulator. No hardware-in-the-loop measurements, validation against independent EON emulators, or external traffic traces are reported, which is load-bearing because any simulator-specific artifacts in modeling nonlinearities, ASE, crosstalk, or 12.5 GHz slot constraints could inflate the reported advantage over benchmarks.
- [Evaluation and Generalization tests] While held-out traffic and TopologyBench topologies are used for testing, the training distribution (uniform requests, fixed load ranges) may still overlap substantially with test conditions, leaving open whether the transformer policy learns robust principles or exploits simulator artifacts. This directly affects the generalization claim for unseen traffic and topologies.
minor comments (2)
- [Abstract] The abstract and introduction could more explicitly quantify the hyper-parameter sensitivity of the RL training (learning rate, entropy coefficient, etc.) and how the stabilization techniques mitigate it, to better contextualize the training stability contribution.
- [Results figures and tables] Figure captions and table legends should clarify whether reported blocking probabilities and load values are averaged over multiple random seeds or single runs, given the known variance in RL policies.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive review of our manuscript. We address each major comment point-by-point below, providing honest responses based on the scope and limitations of the work. Where appropriate, we indicate revisions that will be incorporated into the next version of the manuscript.
read point-by-point responses
-
Referee: [Abstract and Evaluation sections] The headline performance claims (up to 13% increase in supportable traffic load) rest entirely on results from the authors' custom GPU-accelerated simulator. No hardware-in-the-loop measurements, validation against independent EON emulators, or external traffic traces are reported, which is load-bearing because any simulator-specific artifacts in modeling nonlinearities, ASE, crosstalk, or 12.5 GHz slot constraints could inflate the reported advantage over benchmarks.
Authors: We acknowledge that all reported performance results, including the up to 13% improvement in supportable traffic load, are obtained exclusively from our custom GPU-accelerated simulator. This simulator implements standard, literature-established models for key optical impairments such as nonlinear effects, ASE noise, crosstalk, and the 12.5 GHz frequency slot constraints. We agree that the lack of hardware-in-the-loop validation or cross-checks against independent EON emulators constitutes a genuine limitation, as we do not have access to physical testbeds. To address the concern directly, we have open-sourced the full simulator code (as stated in the manuscript) to enable community verification, and we will add an explicit discussion subsection in the revised Evaluation section outlining the simulator's modeling assumptions and potential artifacts. This represents the strongest honest response within the paper's focus on algorithmic and training innovations rather than hardware experimentation. revision: partial
-
Referee: [Evaluation and Generalization tests] While held-out traffic and TopologyBench topologies are used for testing, the training distribution (uniform requests, fixed load ranges) may still overlap substantially with test conditions, leaving open whether the transformer policy learns robust principles or exploits simulator artifacts. This directly affects the generalization claim for unseen traffic and topologies.
Authors: We thank the referee for raising this important point on generalization. Our evaluation protocol uses explicitly held-out traffic request sequences and network topologies drawn from TopologyBench that are never seen during training. Training employs uniform request generation across load ranges to encourage learning of general policies, while testing measures performance on these unseen instances. Nevertheless, we recognize that the underlying request arrival model remains the same, which could in principle allow the policy to exploit simulator-specific patterns rather than purely robust principles. To strengthen the generalization claims, we will incorporate additional experiments in the revised manuscript that employ non-uniform traffic distributions and further varied held-out topologies; preliminary internal checks indicate these yield consistent advantages, supporting that the learned policy captures allocation principles beyond simulator artifacts. revision: yes
- Hardware-in-the-loop measurements, validation against independent EON emulators, or use of external traffic traces, as we lack access to physical elastic optical network testbeds and such resources.
Circularity Check
No circularity: empirical RL performance measured on held-out simulator instances and public topologies.
full rationale
The paper reports measured blocking probabilities and supportable traffic loads obtained by training a graph-transformer policy with stabilized RL inside a GPU simulator and evaluating on held-out traffic and TopologyBench graphs. No derivation chain reduces a claimed performance gain to a fitted constant, self-defined quantity, or load-bearing self-citation; the central result is an empirical comparison against baselines, with code released for reproduction. Standard ML generalization risks exist but do not constitute the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
free parameters (2)
- RL hyperparameters (learning rate, discount factor, entropy coefficient)
- Transformer architecture parameters (number of layers, attention heads, embedding dimension)
axioms (2)
- domain assumption The optical network state can be represented as a graph with nodes, links, and frequency slots for sequential decision making.
- domain assumption A Markov decision process formulation captures the dynamics of traffic arrivals and blocking.
Lean theorems connected to this paper
-
Cost/FunctionalEquation.lean — RS forcing chain has zero adjustable parameters; this paper is parameter-heavy by contrast, but not in conflict since the domains differ.washburn_uniqueness_aczel unclearThe PPO hyperparameters ... clip ϵ=0.04, discount γ=0.996, GAE λ=0.99 ... entropy coefficient (0.012–0.0225), and value function coefficient (0.01–0.1) are lightly tuned per problem setting.
Reference graph
Works this paper leans on
-
[1]
Decomposition Mod- els for the Routing and Slot Provisioning Problem,
B. Jaumard, A. Mohammed, and Q. A. Nguyen, “Decomposition Mod- els for the Routing and Slot Provisioning Problem, ” in 2023 Interna- tional Conference on Computing, Networking and Communications (ICNC), (2023), pp. 659–665
2023
-
[2]
Rein- forcement learning for dynamic resource allocation in optical networks: hype or hope?
M. Doherty , R. Matzner, R. Sadeghi, P . Bayvel, and A. Beghelli, “Rein- forcement learning for dynamic resource allocation in optical networks: hype or hope?” J. Opt. Commun. Netw. 17, D1 (2025)
2025
-
[3]
Attention Is All Y ou Need,
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention Is All Y ou Need, ” (2017). Version Number: 7
2017
-
[4]
T ransformer-pointer DRL model for static resource allocation problems in SDM-EONs,
S. Chen, J. Wang, and M. Shigeno, “T ransformer-pointer DRL model for static resource allocation problems in SDM-EONs, ” J. Opt. Com- mun. Netw. 18, 315 (2026)
2026
-
[5]
E. Parisotto, H. F . Song, J. W. Rae, R. Pascanu, C. Gulcehre, S. M. Jayakumar, M. Jaderberg, R. L. Kaufman, A. Clark, S. Noury , M. M. Botvinick, N. Heess, and R. Hadsell, “Stabilizing T ransformers for Re- inforcement Learning, ” (2019). ArXiv:1910.06764 [cs]
-
[6]
XLRON: Accelerated Learning and Resource Allocation for Optical Networks,
M. Doherty , “XLRON: Accelerated Learning and Resource Allocation for Optical Networks, ”https://github.com/micdoh/XLRON.git (2023)
2023
-
[7]
XLRON: Accelerated Reinforcement Learning Environments for Optical Networks,
M. Doherty and A. Beghelli, “XLRON: Accelerated Reinforcement Learning Environments for Optical Networks, ” in 2024 Optical Fiber Communications Conference and Exhibition (OFC), (2024), pp. 1–3
2024
-
[8]
Podracer architectures for scalable Reinforce- ment Learning,
M. Hessel, M. Kroiss, A. Clark, I. Kemaev, J. Quan, T . Keck, F . Viola, and H. van Hasselt, “Podracer architectures for scalable Reinforce- ment Learning, ” (2021). ArXiv:2104.06272 [cs]
-
[9]
Rotary Position Encodings for Graphs,
I. Reid, A. Sehanobish, C. HÃ ˝ ufs, B. Mlodozeniec, L. Vulpius, F . Bar- bero, A. Weller, K. Choromanski, R. E. T urner, and P . VeliÄ koviÄ ˘G, Fig. 14. Difference in link usage between Transformer and FF-KSP for each link in TataInd and USA100. Positive values (green) indicate links that have more requests that use them allocated by the Transformer; ne...
-
[10]
On layer normalization in the transformer architecture, 2020
R. Xiong, Y . Y ang, D. He, K. Zheng, S. Zheng, C. Xing, H. Zhang, Y . Lan, L. Wang, and T .-Y . Liu, “On Layer Normalization in the T rans- former Architecture, ” (2020). ArXiv:2002.04745 [cs]
-
[11]
DeepRMSA: A Deep Reinforcement Learning Framework for Routing, Modulation and Spectrum Assignment in Elastic Optical Networks,
X. Chen, B. Li, R. Proietti, H. Lu, Z. Zhu, and S. J. B. Y oo, “DeepRMSA: A Deep Reinforcement Learning Framework for Routing, Modulation and Spectrum Assignment in Elastic Optical Networks, ” J. Light. T ech- nol. 37, 4155–4163 (2019)
2019
-
[12]
Heuristic Reward De- sign for Deep Reinforcement Learning-Based Routing, Modulation and Spectrum Assignment of Elastic Optical Networks,
B. T ang, Y .-C. Huang, Y . Xue, and W. Zhou, “Heuristic Reward De- sign for Deep Reinforcement Learning-Based Routing, Modulation and Spectrum Assignment of Elastic Optical Networks, ” IEEE Com- mun. Lett. 26, 2675–2679 (2022)
2022
-
[13]
Deep Reinforcement Learning- Based Routing and Spectrum Assignment of EONs by Exploiting GCN and RNN for Feature Extraction,
L. Xu, Y .-C. Huang, Y . Xue, and X. Hu, “Deep Reinforcement Learning- Based Routing and Spectrum Assignment of EONs by Exploiting GCN and RNN for Feature Extraction, ” J. Light. T echnol. 40, 4945–4955 (2022)
2022
-
[14]
Mask RSA: End-T o-End Reinforcement Learning-based Routing and Spectrum Assignment in Elastic Optical Networks,
M. Shimoda and T . T anaka, “Mask RSA: End-T o-End Reinforcement Learning-based Routing and Spectrum Assignment in Elastic Optical Networks, ” in2021 European Conference on Optical Communication (ECOC), (IEEE, Bordeaux, France, 2021), pp. 1–4
2021
-
[15]
PtrNet-RSA: A Pointer Network-based QoT-aware Routing and Spectrum Assign- ment Scheme in Elastic Optical Networks,
Y . Cheng, S. Ding, Y . Shao, and C.-K. Chan, “PtrNet-RSA: A Pointer Network-based QoT-aware Routing and Spectrum Assign- ment Scheme in Elastic Optical Networks, ” J. Light. T echnol. pp. 1–12 (2024)
2024
-
[16]
T opology Bench: Systematic Graph Based Benchmarking for Core Optical Networks,
R. Matzner, A. Ahuja, R. Sadeghi, M. Doherty , A. Beghelli, S. J. Savory , and P . Bayvel, “T opology Bench: Systematic Graph Based Benchmarking for Core Optical Networks, ” (2024). Version Number: 1
2024
-
[17]
A multicast reinforcement learn- ing algorithm for WDM optical networks,
P . Garcia, A. Zsigri, and A. Guitton, “A multicast reinforcement learn- ing algorithm for WDM optical networks, ” in Proceedings of the 7th In- ternational Conference on T elecommunications, 2003. ConTEL 2003., (IEEE, Zagreb, Croatia, 2003), pp. 419–426 vol.2
2003
-
[18]
Action Guidance: Getting the Best of Sparse Rewards and Shaped Rewards for Real-time Strategy Games,
S. Huang and S. Ontañón, “Action Guidance: Getting the Best of Sparse Rewards and Shaped Rewards for Real-time Strategy Games, ” (2020). ArXiv:2010.03956 [cs, stat]
-
[19]
Deep-reinforcement- learning-based RMSCA for space division multiplexing networks with multi-core fibers [Invited T utorial],
Y . T eng, C. Natalino, H. Li, R. Y ang, J. Majeed, S. Shen, P . Monti, R. Nejabati, S. Y an, and D. Simeonidou, “Deep-reinforcement- learning-based RMSCA for space division multiplexing networks with multi-core fibers [Invited T utorial], ” J. Opt. Commun. Netw. 16, C76 (2024)
2024
-
[20]
DRL-Assisted QoT-Aware Ser- vice Provisioning in Multi-Band Elastic Optical Networks,
Y . T eng, C. Natalino, F . Arpanaei, H. Li, A. Sà ˛ anchez-Macià ˛ an, P . Monti, S. Y an, and D. Simeonidou, “DRL-Assisted QoT-Aware Ser- vice Provisioning in Multi-Band Elastic Optical Networks, ” J. Light. T echnol.43, 9090–9101 (2025)
2025
-
[21]
Physical layer-aware deep reinforcement learning with advantage function stabilization for dynamic RMSA in elastic optical networks,
H. Wang, Y . Wang, Y . Zhao, and J. Zhang, “Physical layer-aware deep reinforcement learning with advantage function stabilization for dynamic RMSA in elastic optical networks, ” J. Opt. Commun. Netw. 18, 250 (2026)
2026
-
[22]
Proximal Policy Optimization Algorithms
J. Schulman, F . Wolski, P . Dhariwal, A. Radford, and O. Klimov, “Prox- Research Article 13 imal Policy Optimization Algorithms, ” (2017). ArXiv:1707.06347 [cs]
work page internal anchor Pith review arXiv 2017
-
[23]
High-Dimensional Continuous Control Using Generalized Advantage Estimation
J. Schulman, P . Moritz, S. Levine, M. Jordan, and P . Abbeel, “High- Dimensional Continuous Control Using Generalized Advantage Esti- mation, ” (2018). ArXiv:1506.02438 [cs]
work page internal anchor Pith review arXiv 2018
-
[24]
Reward centering.arXiv preprint arXiv:2405.09999, 2024
A. Naik, Y . Wan, M. T omar, and R. S. Sutton, “Reward Centering, ” (2024). ArXiv:2405.09999 [cs]
-
[25]
Overcoming Valid Action Suppression in Unmasked Pol- icy Gradient Algorithms,
R. Zabounidis, R. Siegelmann, M. Qadri, W. Kim, S. Stepputtis, and K. P . Sycara, “Overcoming Valid Action Suppression in Unmasked Pol- icy Gradient Algorithms, ” (2026). ArXiv:2603.09090 [cs]
-
[26]
Exploring the Use of Invalid Action Masking in Reinforcement Learning: A Com- parative Study of On-Policy and Off-Policy Algorithms in Real-Time Strategy Games,
Y . Hou, X. Liang, J. Zhang, Q. Y ang, A. Y ang, and N. Wang, “Exploring the Use of Invalid Action Masking in Reinforcement Learning: A Com- parative Study of On-Policy and Off-Policy Algorithms in Real-Time Strategy Games, ” Appl. Sci.13, 8283 (2023)
2023
-
[27]
PonderNet: Learning to ponder.arXiv preprint arXiv:2106.01345,
L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M. Laskin, P . Abbeel, A. Srinivas, and I. Mordatch, “Decision T ransformer: Reinforcement Learning via Sequence Modeling, ” (2021). ArXiv:2106.01345 [cs]
-
[28]
J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization, ” (2016). ArXiv:1607.06450 [cs, stat]
work page Pith review arXiv 2016
-
[29]
P . VeliÄ koviÄ ˘G, G. Cucurull, A. Casanova, A. Romero, P . LiÚ, and Y . Bengio, “Graph Attention Networks, ” (2018). ArXiv:1710.10903 [stat]
work page internal anchor Pith review arXiv 2018
-
[30]
How attentive are graph attention networks?arXiv preprint arXiv:2105.14491,
S. Brody , U. Alon, and E. Y ahav, “How Attentive are Graph Attention Networks?” (2022). ArXiv:2105.14491 [cs]
-
[31]
Graph Attention Network En- hanced Deep Reinforcement Learning Framework for Routing, Mod- ulation, and Spectrum Allocation in EONs,
Z. Xiong, Y .-C. Huang, and X. Hu, “Graph Attention Network En- hanced Deep Reinforcement Learning Framework for Routing, Mod- ulation, and Spectrum Allocation in EONs, ” in 2024 Asia Communica- tions and Photonics Conference (ACP) and International Conference on Information Photonics and Optical Communications (IPOC), (IEEE, Beijing, China, 2024), pp. 1–6
2024
-
[32]
Do transformers really perform bad for graph representation?, 2021
C. Ying, T . Cai, S. Luo, S. Zheng, G. Ke, D. He, Y . Shen, and T .-Y . Liu, “Do T ransformers Really Perform Bad for Graph Representation?” (2021). ArXiv:2106.05234 [cs]
-
[33]
Graph Inductive Biases in T ransformers without Message Passing,
L. Ma, C. Lin, D. Lim, A. Romero-Soriano, P . K. Dokania, M. Coates, P . T orr, and S.-N. Lim, “Graph Inductive Biases in T ransformers without Message Passing, ” (2023). ArXiv:2305.17589 [cs]
-
[34]
Comparing Graph T ransformers via Positional Encodings,
M. Black, Z. Wan, G. Mishne, A. Nayyeri, and Y . Wang, “Comparing Graph T ransformers via Positional Encodings, ” (2024). ArXiv:2402.14202 [cs]
-
[35]
RoFormer: Enhanced Transformer with Rotary Position Embedding
J. Su, Y . Lu, S. Pan, A. Murtadha, B. Wen, and Y . Liu, “Ro- Former: Enhanced T ransformer with Rotary Position Embedding, ” (2023). ArXiv:2104.09864 [cs]
work page internal anchor Pith review arXiv 2023
-
[36]
arXiv preprint arXiv:2502.11032 , year=
S. Ennadir, M. Vazirgiannis, and R. Liao, “Pool me wisely: Rethinking graph pooling in graph transformers, ” (2025). ArXiv:2502.11032
-
[37]
Pointer Networks,
O. Vinyals, M. Fortunato, and N. Jaitly , “Pointer Networks, ” (2015). Ver- sion Number: 2
2015
-
[38]
Cost-effective network capac- ity upgrade by heterogeneous wavelength division multiplexing density with bandwidth-variable virtual direct links,
K. Hayashi, Y . Mori, and H. Hasegawa, “Cost-effective network capac- ity upgrade by heterogeneous wavelength division multiplexing density with bandwidth-variable virtual direct links, ” J. Opt. Commun. Netw.15, D23–D32 (2023)
2023
-
[39]
Effective Capacity Estimation Based on Cut-Set Load Analysis in Optical Path Networks,
K. Cruzado, Y . Mori, S.-C. Lin, M. Matsuura, S. Subramaniam, and H. Hasegawa, “Effective Capacity Estimation Based on Cut-Set Load Analysis in Optical Path Networks, ” in 2023 International Conference on Photonics in Switching and Computing (PSC), (2023), pp. 1–3
2023
-
[40]
Capacity-Bound Evaluation and Routing and Spec- trum Assignment for Elastic Optical Path Networks with Distance- Adaptive Modulation,
K. Cruzado, Y . Mori, S.-C. Lin, M. Matsuura, S. Subramaniam, and H. Hasegawa, “Capacity-Bound Evaluation and Routing and Spec- trum Assignment for Elastic Optical Path Networks with Distance- Adaptive Modulation, ” in2024 Optical Fiber Communications Confer- ence and Exhibition (OFC), (2024), pp. 1–3
2024
-
[41]
Routing and wavelength allocation in WDM optical net- works,
S. Baroni, “Routing and wavelength allocation in WDM optical net- works, ” Ph.D. thesis, University College London, United Kingdom (1998)
1998
-
[42]
Resource allocation and scalability in dynamic wavelength-routed optical networks,
A. Beghelli, “Resource allocation and scalability in dynamic wavelength-routed optical networks, ” Ph.D. thesis, University of Lon- don (2006)
2006
-
[43]
Staggered Environment Resets Im- prove Massively Parallel On-Policy Reinforcement Learning,
S. Bharthulwar, S. T ao, and H. Su, “Staggered Environment Resets Im- prove Massively Parallel On-Policy Reinforcement Learning, ” (2025). ArXiv:2511.21011 [cs]
-
[44]
Interpreting multi-objective reinforcement learning for routing and wavelength assignment in optical networks,
S. Nallaperuma, Z. Gan, J. Nevin, M. Shevchenko, and S. J. Sa- vory , “Interpreting multi-objective reinforcement learning for routing and wavelength assignment in optical networks, ” J. Opt. Commun. Netw. 15, 497 (2023)
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.