pith. sign in

arxiv: 2606.12816 · v3 · pith:CFJHE2TQnew · submitted 2026-06-11 · 🪐 quant-ph · cs.ET· cs.LG

Graph Reinforcement Learning for Calibration-Aware Quantum Circuit Routing

Pith reviewed 2026-06-27 07:02 UTC · model grok-4.3

classification 🪐 quant-ph cs.ETcs.LG
keywords quantum circuit routingreinforcement learningcalibration-aware compilationquantum fidelityIBM Heron processorgraph reinforcement learningproximal policy optimization
0
0 comments X

The pith

Calibration-aware graph reinforcement learning routing improves quantum circuit fidelity over gate-count methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether a reinforcement learning policy can select better SWAP routes for quantum circuits by incorporating same-day calibration measurements from the hardware instead of optimizing only for the number of gates. It trains the policy using proximal policy optimization on a graph of the processor and measures success through exact simulated fidelity on nine benchmark circuits across three calibration snapshots. A sympathetic reader would care because routing decisions directly determine how much noise a circuit accumulates on real devices; if calibration-aware routing works, it offers a software-only way to raise the effective quality of existing noisy processors without new hardware.

Core claim

The paper claims that its calibration-aware graph RL router reaches a pooled mean exact fidelity of 0.727 on the evaluated circuits, exceeding the 0.440 of SABRE-best20 and the 0.481 of target-aware SABRE, and concludes that learned routing informed by calibration data can improve fidelity beyond what gate-count-driven compilation achieves.

What carries the argument

A proximal-policy-optimization graph RL policy that selects hardware-edge SWAPs using same-day calibration data as input.

If this is right

  • Fidelity gains appear mainly in the 5-qubit and 8-qubit circuit families.
  • The RL routes use more two-qubit gates yet still deliver higher fidelity.
  • All 10-qubit families continue to favor SABRE-best20 when the tree action graph is held fixed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Daily calibration snapshots can be fed directly into routing policies to avoid poorly performing couplers without changing the circuit itself.
  • The observed size dependence suggests that expanding the action graph could extend gains to larger circuits.
  • The method could be applied to other processors whose calibration drifts on similar timescales.

Load-bearing premise

The fixed tree action graph chosen for the RL policy captures enough useful routing decisions for circuits of all tested sizes.

What would settle it

A head-to-head fidelity comparison on the same 10-qubit circuit families and calibration snapshots in which the RL router still underperforms SABRE-best20 would falsify the claim that calibration-aware learned routing improves fidelity beyond gate-count methods.

Figures

Figures reproduced from arXiv: 2606.12816 by Dheeraj Peddireddy, Vaneet Aggarwal, Yash Vardhan Tomar.

Figure 1
Figure 1. Figure 1: Routing-state graph construction. (a) Circuit inputs consist of the remaining circuit, front blocking gate 𝑔𝑡 , and lookahead gates 𝐺𝑡 . (b) Current non-identity placement 𝑀𝑡 : 𝐿 → 𝑃 and calibration snapshot 𝜅 are encoded on 𝐺𝐵 = (𝑃, 𝐸, 𝜅); node labels show logical occupants, and red rings mark the blocked front-gate operands. A solid 𝑝3–𝑝4 edge marks a legal SWAP edge on the shortest path. (c) Message pas… view at source ↗
Figure 2
Figure 2. Figure 2: Benchmark exact fidelity and routed two-qubit (2Q) counts. (a,b) Snapshot means with matched cells and bootstrap intervals. (c,d) Circuit-family means. Fidelity gains coincide with additional 2Q gates on families where calibration-aware routing helps. At the family level, 5q and 8q circuits improve when the router spends extra 2Q gates to avoid less reliable couplers, while 10q circuits add overhead with l… view at source ↗
read the original abstract

Quantum circuit routing is a key step in compiling programs for noisy intermediate-scale quantum processors. Routes that appear efficient by standard overhead metrics can still lose fidelity when they pass through poorly calibrated couplers. We study a calibration-aware graph reinforcement-learning router that uses same-day IBM Heron r2 calibration data to choose hardware-edge SWAPs. We train the policy with proximal policy optimization and evaluate it with exact simulated fidelity across nine Munich Quantum Toolkit (MQT) Bench circuits and three calibration snapshots. Across these evaluations, pooled mean exact fidelity is $0.727$, compared with $0.440$ for SABRE-best20 and $0.481$ for target-aware SABRE. We observed that fidelity gains came with higher routed two-qubit counts and were concentrated in 5 qubit and 8 qubit circuit families; under the fixed tree action graph, all 10 qubit families favored SABRE-best20. Overall, our results show that calibration-aware learned routing can improve fidelity beyond gate-count-driven compilation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that a calibration-aware graph RL router, trained with PPO on same-day IBM Heron r2 calibration snapshots, achieves a pooled mean exact simulated fidelity of 0.727 across nine MQT Bench circuits and three snapshots, outperforming SABRE-best20 (0.440) and target-aware SABRE (0.481). Fidelity gains occur with higher routed two-qubit counts and are concentrated in 5- and 8-qubit families; under the fixed tree action graph all 10-qubit families revert to preferring SABRE-best20. The work concludes that calibration-aware learned routing can improve fidelity beyond gate-count-driven compilation.

Significance. If the empirical comparison holds, the result provides concrete evidence that incorporating real calibration data into an RL policy can yield higher exact fidelity than standard SABRE variants on smaller circuits. Credit is due for the use of external calibration snapshots rather than synthetic noise models and for reporting exact simulated fidelity instead of proxy metrics such as gate count. The explicit acknowledgment of the 10-qubit limitation is also a strength.

major comments (2)
  1. [Abstract] Abstract: The central claim that calibration-aware learned routing improves fidelity beyond gate-count-driven compilation is load-bearing on the policy being able to select useful SWAPs inside the chosen fixed tree action graph. The abstract itself states that all 10-qubit families favored SABRE-best20 under this graph, indicating that the action-space restriction—not the calibration signal or PPO training—is the factor preventing competitive routes at larger sizes and thereby limiting the generality of the reported 0.727 pooled mean fidelity.
  2. [Evaluation] Evaluation section (inferred from abstract and results): The reported fidelity numbers rely on simulated exact fidelity whose match to actual hardware execution is unverified in the manuscript; without this validation or at least a hardware run on a subset of circuits, the practical significance of the 0.727 vs. 0.440/0.481 comparison remains provisional.
minor comments (2)
  1. [Methods] The manuscript should clarify the precise construction and size of the fixed tree action graph and justify why this particular graph was selected rather than a denser or hardware-native graph.
  2. [Methods] Training hyperparameters, network architecture, and reward function for the PPO policy are not described in sufficient detail to allow reproduction of the reported fidelity numbers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the strengths of our approach in using real calibration data and exact fidelity metrics. We respond to each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that calibration-aware learned routing improves fidelity beyond gate-count-driven compilation is load-bearing on the policy being able to select useful SWAPs inside the chosen fixed tree action graph. The abstract itself states that all 10-qubit families favored SABRE-best20 under this graph, indicating that the action-space restriction—not the calibration signal or PPO training—is the factor preventing competitive routes at larger sizes and thereby limiting the generality of the reported 0.727 pooled mean fidelity.

    Authors: We agree that the fixed tree action graph restricts the policy's options, particularly at 10 qubits where SABRE-best20 is preferred. This choice was necessary to keep the action space manageable for PPO training. Nevertheless, the calibration-aware policy achieves higher fidelity than the baselines for the 5- and 8-qubit circuits, demonstrating that the calibration signal enables better SWAP selections within the available actions. We will revise the abstract to clarify that the reported improvements are achieved under this action space constraint. revision: partial

  2. Referee: [Evaluation] Evaluation section (inferred from abstract and results): The reported fidelity numbers rely on simulated exact fidelity whose match to actual hardware execution is unverified in the manuscript; without this validation or at least a hardware run on a subset of circuits, the practical significance of the 0.727 vs. 0.440/0.481 comparison remains provisional.

    Authors: The manuscript reports exact simulated fidelity computed from the calibration data, which provides a direct estimate of expected performance. We acknowledge that actual hardware execution would provide stronger validation of the practical impact. However, obtaining consistent hardware runs across multiple calibration snapshots poses logistical challenges, and the current evaluation focuses on simulation to isolate the effect of the routing policy. revision: no

Circularity Check

0 steps flagged

No circularity: empirical comparison of RL router against external baselines

full rationale

The paper reports an empirical evaluation of a PPO-trained graph RL policy for quantum circuit routing, using same-day IBM calibration snapshots and exact simulated fidelity on MQT Bench circuits. The central result (pooled mean fidelity 0.727 vs. 0.440/0.481 for SABRE variants) is obtained by direct measurement on held-out circuits and external hardware data; no equations, fitted parameters, or self-citations are invoked to derive the fidelity numbers from the policy itself. The fixed-tree action graph is an explicit modeling choice whose limitations are acknowledged in the abstract, but this does not create a self-referential derivation. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard RL convergence assumptions and the domain assumption that simulated fidelity is a useful proxy; no new entities are postulated and no free parameters are reported in the abstract.

axioms (2)
  • domain assumption Proximal policy optimization produces a policy that generalizes across calibration snapshots for the routing task.
    Invoked implicitly by training the policy and evaluating on held-out snapshots.
  • domain assumption Exact simulated fidelity on the chosen noise model accurately ranks routing quality for real IBM Heron devices.
    Used to declare the 0.727 result superior to baselines.

pith-pipeline@v0.9.1-grok · 5711 in / 1371 out tokens · 27705 ms · 2026-06-27T07:02:12.294393+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 2 linked inside Pith

  1. [1]

    Quantum Computing in the NISQ Era and Beyond,

    J. Preskill, “Quantum Computing in the NISQ Era and Beyond,”Quantum, vol. 2, art. 79, 2018

  2. [2]

    Qubit Allocation,

    M. Y. Siraichiet al., “Qubit Allocation,”Proc. ACM Program. Lang., vol. 3, OOPSLA, 2019

  3. [3]

    Mapping Quantum Circuits to IBM QX Architectures,

    A. Zulehneret al., “Mapping Quantum Circuits to IBM QX Architectures,”IEEE TCAD, vol. 38, no. 7, pp. 1226–1236, 2019

  4. [4]

    On the Qubit Routing Problem,

    A. Cowtanet al., “On the Qubit Routing Problem,” inProc. TQC, 2019

  5. [5]

    Tackling the Qubit Mapping Problem for NISQ-Era Quantum Devices,

    G. Li, Y. Ding, and Y. Xie, “Tackling the Qubit Mapping Problem for NISQ-Era Quantum Devices,” inProc. ASPLOS, 2019

  6. [6]

    Qiskit:AnOpen-sourceFrameworkforQuantumComputing,

    Qiskitcontributors,“Qiskit:AnOpen-sourceFrameworkforQuantumComputing,” version 2.4.1, 2026

  7. [7]

    Noise-Adaptive Compiler Mappings for NISQ Computers,

    P. Muraliet al., “Noise-Adaptive Compiler Mappings for NISQ Computers,” in Proc. ASPLOS, 2019

  8. [8]

    ExtractingSuccessfromIBM’s20-QubitMachines,

    S.Nishioetal.,“ExtractingSuccessfromIBM’s20-QubitMachines,”ACMJETC, vol. 16, no. 3, 2020

  9. [9]

    EnsembleofDiverseMappings,

    S.S.TannuandM.K.Qureshi,“EnsembleofDiverseMappings,”inProc.MICRO, 2019

  10. [10]

    MQT Bench,

    N. Quetschlichet al., “MQT Bench,”Quantum, vol. 7, art. 1062, 2023

  11. [11]

    FIDDLE: Reinforcement Learning for Quantum Fidelity Enhancement,

    H. M. Ngo, T. Kahveci, and M. T. Thai, “FIDDLE: Reinforcement Learning for Quantum Fidelity Enhancement,”ACM Trans. Quantum Comput., vol. 7, no. 1, art. 7, 2026

  12. [12]

    ProximalPolicyOptimizationAlgorithms,

    J.Schulmanetal.,“ProximalPolicyOptimizationAlgorithms,”arXiv:1707.06347, 2017

  13. [13]

    NeuralMessagePassingforQuantumChemistry,

    J.Gilmeretal.,“NeuralMessagePassingforQuantumChemistry,”inProc.ICML, 2017

  14. [14]

    Semi-Supervised Classification with Graph Convolutional Networks,

    T. N. Kipf and M. Welling, “Semi-Supervised Classification with Graph Convolutional Networks,” inProc. ICLR, 2017

  15. [15]

    Reinforcement Learning for Qubit Routing,

    M. G. Pozziet al., “Reinforcement Learning for Qubit Routing,”ACM Trans. Quantum Comput., vol. 3, no. 2, 2022

  16. [16]

    Qubit Routing Using GNN-Aided MCTS,

    A. Sinha, U. Azad, and H. Singh, “Qubit Routing Using GNN-Aided MCTS,” in Proc. AAAI, 2022

  17. [17]

    DeepRLStrategiesforNoise-Adaptive Qubit Routing,

    G.Pascoal,J.P.Fernandes,andR.Abreu,“DeepRLStrategiesforNoise-Adaptive Qubit Routing,” inProc. IEEE QSW, pp. 146–156, 2024

  18. [18]

    AlphaRouter,

    W. Tanget al., “AlphaRouter,” arXiv:2410.05115, 2024

  19. [19]

    Noise-Adaptive Mapping with GNNs,

    V. Saravanan and S. M. Saeed, “Noise-Adaptive Mapping with GNNs,”IEEE TCAD, 2024

  20. [20]

    H. T. Nguyenet al., “QFOR,” arXiv:2508.04974, 2025

  21. [21]

    Improving and Benchmarking NISQ Qubit Routers,

    V. Pina-Canelles, A. Auer, and I. de Vega, “Improving and Benchmarking NISQ Qubit Routers,” arXiv:2502.03908, 2025

  22. [22]

    Qubit Mapping and Routing Tailored to Advanced Quantum ISAs: Not as Costly as You Think,

    Z. Yanget al., “Qubit Mapping and Routing Tailored to Advanced Quantum ISAs: Not as Costly as You Think,” arXiv:2511.04608, 2025