Graph Reinforcement Learning for Calibration-Aware Quantum Circuit Routing

Dheeraj Peddireddy; Vaneet Aggarwal; Yash Vardhan Tomar

arxiv: 2606.12816 · v3 · pith:CFJHE2TQnew · submitted 2026-06-11 · 🪐 quant-ph · cs.ET· cs.LG

Graph Reinforcement Learning for Calibration-Aware Quantum Circuit Routing

Yash Vardhan Tomar , Dheeraj Peddireddy , Vaneet Aggarwal This is my paper

Pith reviewed 2026-06-27 07:02 UTC · model grok-4.3

classification 🪐 quant-ph cs.ETcs.LG

keywords quantum circuit routingreinforcement learningcalibration-aware compilationquantum fidelityIBM Heron processorgraph reinforcement learningproximal policy optimization

0 comments

The pith

Calibration-aware graph reinforcement learning routing improves quantum circuit fidelity over gate-count methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether a reinforcement learning policy can select better SWAP routes for quantum circuits by incorporating same-day calibration measurements from the hardware instead of optimizing only for the number of gates. It trains the policy using proximal policy optimization on a graph of the processor and measures success through exact simulated fidelity on nine benchmark circuits across three calibration snapshots. A sympathetic reader would care because routing decisions directly determine how much noise a circuit accumulates on real devices; if calibration-aware routing works, it offers a software-only way to raise the effective quality of existing noisy processors without new hardware.

Core claim

The paper claims that its calibration-aware graph RL router reaches a pooled mean exact fidelity of 0.727 on the evaluated circuits, exceeding the 0.440 of SABRE-best20 and the 0.481 of target-aware SABRE, and concludes that learned routing informed by calibration data can improve fidelity beyond what gate-count-driven compilation achieves.

What carries the argument

A proximal-policy-optimization graph RL policy that selects hardware-edge SWAPs using same-day calibration data as input.

If this is right

Fidelity gains appear mainly in the 5-qubit and 8-qubit circuit families.
The RL routes use more two-qubit gates yet still deliver higher fidelity.
All 10-qubit families continue to favor SABRE-best20 when the tree action graph is held fixed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Daily calibration snapshots can be fed directly into routing policies to avoid poorly performing couplers without changing the circuit itself.
The observed size dependence suggests that expanding the action graph could extend gains to larger circuits.
The method could be applied to other processors whose calibration drifts on similar timescales.

Load-bearing premise

The fixed tree action graph chosen for the RL policy captures enough useful routing decisions for circuits of all tested sizes.

What would settle it

A head-to-head fidelity comparison on the same 10-qubit circuit families and calibration snapshots in which the RL router still underperforms SABRE-best20 would falsify the claim that calibration-aware learned routing improves fidelity beyond gate-count methods.

Figures

Figures reproduced from arXiv: 2606.12816 by Dheeraj Peddireddy, Vaneet Aggarwal, Yash Vardhan Tomar.

**Figure 1.** Figure 1: Routing-state graph construction. (a) Circuit inputs consist of the remaining circuit, front blocking gate 𝑔𝑡 , and lookahead gates 𝐺𝑡 . (b) Current non-identity placement 𝑀𝑡 : 𝐿 → 𝑃 and calibration snapshot 𝜅 are encoded on 𝐺𝐵 = (𝑃, 𝐸, 𝜅); node labels show logical occupants, and red rings mark the blocked front-gate operands. A solid 𝑝3–𝑝4 edge marks a legal SWAP edge on the shortest path. (c) Message pas… view at source ↗

**Figure 2.** Figure 2: Benchmark exact fidelity and routed two-qubit (2Q) counts. (a,b) Snapshot means with matched cells and bootstrap intervals. (c,d) Circuit-family means. Fidelity gains coincide with additional 2Q gates on families where calibration-aware routing helps. At the family level, 5q and 8q circuits improve when the router spends extra 2Q gates to avoid less reliable couplers, while 10q circuits add overhead with l… view at source ↗

read the original abstract

Quantum circuit routing is a key step in compiling programs for noisy intermediate-scale quantum processors. Routes that appear efficient by standard overhead metrics can still lose fidelity when they pass through poorly calibrated couplers. We study a calibration-aware graph reinforcement-learning router that uses same-day IBM Heron r2 calibration data to choose hardware-edge SWAPs. We train the policy with proximal policy optimization and evaluate it with exact simulated fidelity across nine Munich Quantum Toolkit (MQT) Bench circuits and three calibration snapshots. Across these evaluations, pooled mean exact fidelity is $0.727$, compared with $0.440$ for SABRE-best20 and $0.481$ for target-aware SABRE. We observed that fidelity gains came with higher routed two-qubit counts and were concentrated in 5 qubit and 8 qubit circuit families; under the fixed tree action graph, all 10 qubit families favored SABRE-best20. Overall, our results show that calibration-aware learned routing can improve fidelity beyond gate-count-driven compilation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gets a concrete fidelity lift (0.727 pooled) over SABRE variants by feeding real IBM calibration data into a graph RL router, but the fixed tree action graph stops working once circuits hit 10 qubits.

read the letter

The main takeaway is that calibration-aware routing via graph RL produces a measurable fidelity improvement on the tested MQT Bench circuits when real same-day IBM Heron data is used. The pooled exact fidelity reaches 0.727 versus 0.440 for SABRE-best20 and 0.481 for the target-aware variant. Gains show up mainly in the 5- and 8-qubit families even though the RL routes use more two-qubit gates.

The work is a straightforward application of PPO on a graph action space to an existing problem. Using actual calibration snapshots instead of abstract noise models is the practical step forward, and the numbers are reported directly from the abstract without obvious circularity.

The soft spot is the action-space restriction. The abstract states that every 10-qubit family still favored SABRE-best20 under the fixed tree graph. That indicates the modeling choice, not the calibration signal itself, is what caps performance at larger sizes. Simulated fidelity is used throughout, with no hardware validation or training details supplied, so the gap between reported and real-device behavior remains untested.

The paper is aimed at people working on NISQ mapping and compilation who already know SABRE and want to see how calibration data changes the picture. The empirical comparison is sharp enough on the small-circuit regime that it deserves a serious referee to examine the methods section and the action-graph design.

Referee Report

2 major / 2 minor

Summary. The paper claims that a calibration-aware graph RL router, trained with PPO on same-day IBM Heron r2 calibration snapshots, achieves a pooled mean exact simulated fidelity of 0.727 across nine MQT Bench circuits and three snapshots, outperforming SABRE-best20 (0.440) and target-aware SABRE (0.481). Fidelity gains occur with higher routed two-qubit counts and are concentrated in 5- and 8-qubit families; under the fixed tree action graph all 10-qubit families revert to preferring SABRE-best20. The work concludes that calibration-aware learned routing can improve fidelity beyond gate-count-driven compilation.

Significance. If the empirical comparison holds, the result provides concrete evidence that incorporating real calibration data into an RL policy can yield higher exact fidelity than standard SABRE variants on smaller circuits. Credit is due for the use of external calibration snapshots rather than synthetic noise models and for reporting exact simulated fidelity instead of proxy metrics such as gate count. The explicit acknowledgment of the 10-qubit limitation is also a strength.

major comments (2)

[Abstract] Abstract: The central claim that calibration-aware learned routing improves fidelity beyond gate-count-driven compilation is load-bearing on the policy being able to select useful SWAPs inside the chosen fixed tree action graph. The abstract itself states that all 10-qubit families favored SABRE-best20 under this graph, indicating that the action-space restriction—not the calibration signal or PPO training—is the factor preventing competitive routes at larger sizes and thereby limiting the generality of the reported 0.727 pooled mean fidelity.
[Evaluation] Evaluation section (inferred from abstract and results): The reported fidelity numbers rely on simulated exact fidelity whose match to actual hardware execution is unverified in the manuscript; without this validation or at least a hardware run on a subset of circuits, the practical significance of the 0.727 vs. 0.440/0.481 comparison remains provisional.

minor comments (2)

[Methods] The manuscript should clarify the precise construction and size of the fixed tree action graph and justify why this particular graph was selected rather than a denser or hardware-native graph.
[Methods] Training hyperparameters, network architecture, and reward function for the PPO policy are not described in sufficient detail to allow reproduction of the reported fidelity numbers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the strengths of our approach in using real calibration data and exact fidelity metrics. We respond to each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that calibration-aware learned routing improves fidelity beyond gate-count-driven compilation is load-bearing on the policy being able to select useful SWAPs inside the chosen fixed tree action graph. The abstract itself states that all 10-qubit families favored SABRE-best20 under this graph, indicating that the action-space restriction—not the calibration signal or PPO training—is the factor preventing competitive routes at larger sizes and thereby limiting the generality of the reported 0.727 pooled mean fidelity.

Authors: We agree that the fixed tree action graph restricts the policy's options, particularly at 10 qubits where SABRE-best20 is preferred. This choice was necessary to keep the action space manageable for PPO training. Nevertheless, the calibration-aware policy achieves higher fidelity than the baselines for the 5- and 8-qubit circuits, demonstrating that the calibration signal enables better SWAP selections within the available actions. We will revise the abstract to clarify that the reported improvements are achieved under this action space constraint. revision: partial
Referee: [Evaluation] Evaluation section (inferred from abstract and results): The reported fidelity numbers rely on simulated exact fidelity whose match to actual hardware execution is unverified in the manuscript; without this validation or at least a hardware run on a subset of circuits, the practical significance of the 0.727 vs. 0.440/0.481 comparison remains provisional.

Authors: The manuscript reports exact simulated fidelity computed from the calibration data, which provides a direct estimate of expected performance. We acknowledge that actual hardware execution would provide stronger validation of the practical impact. However, obtaining consistent hardware runs across multiple calibration snapshots poses logistical challenges, and the current evaluation focuses on simulation to isolate the effect of the routing policy. revision: no

Circularity Check

0 steps flagged

No circularity: empirical comparison of RL router against external baselines

full rationale

The paper reports an empirical evaluation of a PPO-trained graph RL policy for quantum circuit routing, using same-day IBM calibration snapshots and exact simulated fidelity on MQT Bench circuits. The central result (pooled mean fidelity 0.727 vs. 0.440/0.481 for SABRE variants) is obtained by direct measurement on held-out circuits and external hardware data; no equations, fitted parameters, or self-citations are invoked to derive the fidelity numbers from the policy itself. The fixed-tree action graph is an explicit modeling choice whose limitations are acknowledged in the abstract, but this does not create a self-referential derivation. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard RL convergence assumptions and the domain assumption that simulated fidelity is a useful proxy; no new entities are postulated and no free parameters are reported in the abstract.

axioms (2)

domain assumption Proximal policy optimization produces a policy that generalizes across calibration snapshots for the routing task.
Invoked implicitly by training the policy and evaluating on held-out snapshots.
domain assumption Exact simulated fidelity on the chosen noise model accurately ranks routing quality for real IBM Heron devices.
Used to declare the 0.727 result superior to baselines.

pith-pipeline@v0.9.1-grok · 5711 in / 1371 out tokens · 27705 ms · 2026-06-27T07:02:12.294393+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 2 linked inside Pith

[1]

Quantum Computing in the NISQ Era and Beyond,

J. Preskill, “Quantum Computing in the NISQ Era and Beyond,”Quantum, vol. 2, art. 79, 2018

2018
[2]

Qubit Allocation,

M. Y. Siraichiet al., “Qubit Allocation,”Proc. ACM Program. Lang., vol. 3, OOPSLA, 2019

2019
[3]

Mapping Quantum Circuits to IBM QX Architectures,

A. Zulehneret al., “Mapping Quantum Circuits to IBM QX Architectures,”IEEE TCAD, vol. 38, no. 7, pp. 1226–1236, 2019

2019
[4]

On the Qubit Routing Problem,

A. Cowtanet al., “On the Qubit Routing Problem,” inProc. TQC, 2019

2019
[5]

Tackling the Qubit Mapping Problem for NISQ-Era Quantum Devices,

G. Li, Y. Ding, and Y. Xie, “Tackling the Qubit Mapping Problem for NISQ-Era Quantum Devices,” inProc. ASPLOS, 2019

2019
[6]

Qiskit:AnOpen-sourceFrameworkforQuantumComputing,

Qiskitcontributors,“Qiskit:AnOpen-sourceFrameworkforQuantumComputing,” version 2.4.1, 2026

2026
[7]

Noise-Adaptive Compiler Mappings for NISQ Computers,

P. Muraliet al., “Noise-Adaptive Compiler Mappings for NISQ Computers,” in Proc. ASPLOS, 2019

2019
[8]

ExtractingSuccessfromIBM’s20-QubitMachines,

S.Nishioetal.,“ExtractingSuccessfromIBM’s20-QubitMachines,”ACMJETC, vol. 16, no. 3, 2020

2020
[9]

EnsembleofDiverseMappings,

S.S.TannuandM.K.Qureshi,“EnsembleofDiverseMappings,”inProc.MICRO, 2019

2019
[10]

MQT Bench,

N. Quetschlichet al., “MQT Bench,”Quantum, vol. 7, art. 1062, 2023

2023
[11]

FIDDLE: Reinforcement Learning for Quantum Fidelity Enhancement,

H. M. Ngo, T. Kahveci, and M. T. Thai, “FIDDLE: Reinforcement Learning for Quantum Fidelity Enhancement,”ACM Trans. Quantum Comput., vol. 7, no. 1, art. 7, 2026

2026
[12]

ProximalPolicyOptimizationAlgorithms,

J.Schulmanetal.,“ProximalPolicyOptimizationAlgorithms,”arXiv:1707.06347, 2017

Pith/arXiv arXiv 2017
[13]

NeuralMessagePassingforQuantumChemistry,

J.Gilmeretal.,“NeuralMessagePassingforQuantumChemistry,”inProc.ICML, 2017

2017
[14]

Semi-Supervised Classification with Graph Convolutional Networks,

T. N. Kipf and M. Welling, “Semi-Supervised Classification with Graph Convolutional Networks,” inProc. ICLR, 2017

2017
[15]

Reinforcement Learning for Qubit Routing,

M. G. Pozziet al., “Reinforcement Learning for Qubit Routing,”ACM Trans. Quantum Comput., vol. 3, no. 2, 2022

2022
[16]

Qubit Routing Using GNN-Aided MCTS,

A. Sinha, U. Azad, and H. Singh, “Qubit Routing Using GNN-Aided MCTS,” in Proc. AAAI, 2022

2022
[17]

DeepRLStrategiesforNoise-Adaptive Qubit Routing,

G.Pascoal,J.P.Fernandes,andR.Abreu,“DeepRLStrategiesforNoise-Adaptive Qubit Routing,” inProc. IEEE QSW, pp. 146–156, 2024

2024
[18]

AlphaRouter,

W. Tanget al., “AlphaRouter,” arXiv:2410.05115, 2024

arXiv 2024
[19]

Noise-Adaptive Mapping with GNNs,

V. Saravanan and S. M. Saeed, “Noise-Adaptive Mapping with GNNs,”IEEE TCAD, 2024

2024
[20]

H. T. Nguyenet al., “QFOR,” arXiv:2508.04974, 2025

arXiv 2025
[21]

Improving and Benchmarking NISQ Qubit Routers,

V. Pina-Canelles, A. Auer, and I. de Vega, “Improving and Benchmarking NISQ Qubit Routers,” arXiv:2502.03908, 2025

arXiv 2025
[22]

Qubit Mapping and Routing Tailored to Advanced Quantum ISAs: Not as Costly as You Think,

Z. Yanget al., “Qubit Mapping and Routing Tailored to Advanced Quantum ISAs: Not as Costly as You Think,” arXiv:2511.04608, 2025

Pith/arXiv arXiv 2025

[1] [1]

Quantum Computing in the NISQ Era and Beyond,

J. Preskill, “Quantum Computing in the NISQ Era and Beyond,”Quantum, vol. 2, art. 79, 2018

2018

[2] [2]

Qubit Allocation,

M. Y. Siraichiet al., “Qubit Allocation,”Proc. ACM Program. Lang., vol. 3, OOPSLA, 2019

2019

[3] [3]

Mapping Quantum Circuits to IBM QX Architectures,

A. Zulehneret al., “Mapping Quantum Circuits to IBM QX Architectures,”IEEE TCAD, vol. 38, no. 7, pp. 1226–1236, 2019

2019

[4] [4]

On the Qubit Routing Problem,

A. Cowtanet al., “On the Qubit Routing Problem,” inProc. TQC, 2019

2019

[5] [5]

Tackling the Qubit Mapping Problem for NISQ-Era Quantum Devices,

G. Li, Y. Ding, and Y. Xie, “Tackling the Qubit Mapping Problem for NISQ-Era Quantum Devices,” inProc. ASPLOS, 2019

2019

[6] [6]

Qiskit:AnOpen-sourceFrameworkforQuantumComputing,

Qiskitcontributors,“Qiskit:AnOpen-sourceFrameworkforQuantumComputing,” version 2.4.1, 2026

2026

[7] [7]

Noise-Adaptive Compiler Mappings for NISQ Computers,

P. Muraliet al., “Noise-Adaptive Compiler Mappings for NISQ Computers,” in Proc. ASPLOS, 2019

2019

[8] [8]

ExtractingSuccessfromIBM’s20-QubitMachines,

S.Nishioetal.,“ExtractingSuccessfromIBM’s20-QubitMachines,”ACMJETC, vol. 16, no. 3, 2020

2020

[9] [9]

EnsembleofDiverseMappings,

S.S.TannuandM.K.Qureshi,“EnsembleofDiverseMappings,”inProc.MICRO, 2019

2019

[10] [10]

MQT Bench,

N. Quetschlichet al., “MQT Bench,”Quantum, vol. 7, art. 1062, 2023

2023

[11] [11]

FIDDLE: Reinforcement Learning for Quantum Fidelity Enhancement,

H. M. Ngo, T. Kahveci, and M. T. Thai, “FIDDLE: Reinforcement Learning for Quantum Fidelity Enhancement,”ACM Trans. Quantum Comput., vol. 7, no. 1, art. 7, 2026

2026

[12] [12]

ProximalPolicyOptimizationAlgorithms,

J.Schulmanetal.,“ProximalPolicyOptimizationAlgorithms,”arXiv:1707.06347, 2017

Pith/arXiv arXiv 2017

[13] [13]

NeuralMessagePassingforQuantumChemistry,

J.Gilmeretal.,“NeuralMessagePassingforQuantumChemistry,”inProc.ICML, 2017

2017

[14] [14]

Semi-Supervised Classification with Graph Convolutional Networks,

T. N. Kipf and M. Welling, “Semi-Supervised Classification with Graph Convolutional Networks,” inProc. ICLR, 2017

2017

[15] [15]

Reinforcement Learning for Qubit Routing,

M. G. Pozziet al., “Reinforcement Learning for Qubit Routing,”ACM Trans. Quantum Comput., vol. 3, no. 2, 2022

2022

[16] [16]

Qubit Routing Using GNN-Aided MCTS,

A. Sinha, U. Azad, and H. Singh, “Qubit Routing Using GNN-Aided MCTS,” in Proc. AAAI, 2022

2022

[17] [17]

DeepRLStrategiesforNoise-Adaptive Qubit Routing,

G.Pascoal,J.P.Fernandes,andR.Abreu,“DeepRLStrategiesforNoise-Adaptive Qubit Routing,” inProc. IEEE QSW, pp. 146–156, 2024

2024

[18] [18]

AlphaRouter,

W. Tanget al., “AlphaRouter,” arXiv:2410.05115, 2024

arXiv 2024

[19] [19]

Noise-Adaptive Mapping with GNNs,

V. Saravanan and S. M. Saeed, “Noise-Adaptive Mapping with GNNs,”IEEE TCAD, 2024

2024

[20] [20]

H. T. Nguyenet al., “QFOR,” arXiv:2508.04974, 2025

arXiv 2025

[21] [21]

Improving and Benchmarking NISQ Qubit Routers,

V. Pina-Canelles, A. Auer, and I. de Vega, “Improving and Benchmarking NISQ Qubit Routers,” arXiv:2502.03908, 2025

arXiv 2025

[22] [22]

Qubit Mapping and Routing Tailored to Advanced Quantum ISAs: Not as Costly as You Think,

Z. Yanget al., “Qubit Mapping and Routing Tailored to Advanced Quantum ISAs: Not as Costly as You Think,” arXiv:2511.04608, 2025

Pith/arXiv arXiv 2025