SQARL: A Size-Agnostic Reinforcement Learning approach for Circuit Allocation in Distributed Quantum Architectures

J\'ulia L\'opez-Closa; Mario Martin; V\'ictor Carballo

arxiv: 2605.27027 · v1 · pith:EE6XD7LXnew · submitted 2026-05-26 · 💻 cs.LG

SQARL: A Size-Agnostic Reinforcement Learning approach for Circuit Allocation in Distributed Quantum Architectures

V\'ictor Carballo , J\'ulia L\'opez-Closa , Mario Martin This is my paper

Pith reviewed 2026-06-29 19:07 UTC · model grok-4.3

classification 💻 cs.LG

keywords reinforcement learningqubit allocationdistributed quantum computingtransformerquantum circuit optimization

0 comments

The pith

Transformer-based RL learns qubit allocation policies that generalize to any circuit and hardware size without retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents SQARL, a reinforcement learning framework that uses transformers to allocate qubits in distributed quantum computers. The key innovation is an architecture that processes inputs of varying sizes, allowing the same trained policy to handle different numbers of qubits and cores. Prior RL methods required retraining for each new configuration, limiting their practicality. The results demonstrate that this approach surpasses previous RL solutions and achieves lower allocation costs than the Hungarian Qubit Allocation heuristic for several test cases, including a 33% improvement on the Cuccaro Adder.

Core claim

The authors show that their size-agnostic transformer RL model learns an allocation policy capable of minimizing inter-core communication for quantum circuits on distributed hardware, generalizing to arbitrary sizes and topologies after one training run, and delivering allocation costs 25% lower on average than previous methods for random circuits.

What carries the argument

A transformer-based policy network that encodes sets of qubits and cores of arbitrary cardinality to select allocation mappings in a reinforcement learning setup for the qubit allocation problem.

If this is right

The same policy applies directly to new hardware configurations without additional training.
Learning methods can achieve allocation quality close to that of specialized heuristics like HQA.
Reduced communication costs from better allocation improve the feasibility of large-scale distributed quantum computations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This flexibility suggests that RL could become the default for dynamic quantum hardware environments where topologies change over time.
Future work might test whether the model can handle noisy intermediate-scale quantum constraints beyond just communication cost.

Load-bearing premise

The model trained on a fixed set of circuit and topology examples will perform well on unseen sizes and topologies.

What would settle it

Evaluating the policy on a circuit with a number of qubits outside the training range or on a hardware graph with different connectivity and measuring if the cost exceeds the HQA baseline.

Figures

Figures reproduced from arXiv: 2605.27027 by J\'ulia L\'opez-Closa, Mario Martin, V\'ictor Carballo.

**Figure 1.** Figure 1: Illustrative example of circuit allocation. The x axis represents the time steps required to execute the circuit. The y axis represents each one of the actual qubits in the quantum hardware (physical qubits). The different processors in the distributed system are divided by dotted lines. The qubits in the original circuit (logical qubits) are labeled and color-coded; gates are drawn as black, vertical line… view at source ↗

**Figure 2.** Figure 2: An annotated example of a quantum circuit. at the same time. For instance, consider the circuit in [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: An example of circuit slicing. 2.2. The Problem of Qubit Allocation Given a quantum circuit and a quantum hardware configuration, the problem of qubit allocation is to distribute the qubits across the different quantum cores for each circuit time slice in such a way that the total cost of the inter-core communications is minimized. A time-sliced quantum circuit G (refer to Sec. 4.1 for the slicing procedu… view at source ↗

**Figure 4.** Figure 4: Allocation cost benchmark between the non-learning and RL state-of-the-art methods. Hardware consists of 10 quantum cores, each with 10 qubits. The intercommunication cost between any pair of cores is one unit. No direct performance comparison in the literature exists between the non-learning and the RL state-of-the-art. We present such a benchmark in [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Policy model’s architecture. The previous tensor contains features for all Q qubits. However, we can focus on the qubits being allocated in this step by selecting and extracting the relevant slices along the qubit dimension of the tensor, resulting in a new tensor of shape [B, C, 2, h], which contains feature vectors only for these qubits and all cores at this allocation step. For single-qubit allocations… view at source ↗

**Figure 6.** Figure 6: A scenario where the free qubits’ order of allocation influences final cost. (a) Partial allocation diagram for hardware with two cores, with three and one qubit respectively, before assigning a core to logical qubit 2, first time slice only. (b) Circuit used in the example. allocation pool. The process is repeated until no qubits remain to be allocated. Note, however, that paired qubits (those that belon… view at source ↗

**Figure 7.** Figure 7: Qubit allocation process. (a) Paired qubits are allocated together and before free qubits. The policy is applied to pairs {q0, q1} and {q2, q3} and, after sampling, the pair {q2, q3} is assigned to the middle core. (b) After pair {q2, q3} is taken out of the allocation pool, the policy is run again on the remaining pairs. (c) Once there are no more paired qubits left, the same process is repeated with the … view at source ↗

**Figure 8.** Figure 8: Normalized allocation cost on the validation set of circuits during training. In Sec. 4.4, we noted that during training, the policy is allowed to take illegal actions but is penalized for doing so. If the invalid movement penalty is not set high enough, the policy could collapse into taking mostly illegal actions, as the performance gains from them outweigh the loss penalty [PITH_FULL_IMAGE:figures/full… view at source ↗

**Figure 9.** Figure 9: Ratio of valid moves selected by the policy during training. Data shown with a Gaussian filter (σ = 10) to smooth noise. generated as described in Sec. 4.4, with 50 time slices (whose costs are averaged under the category “Random Avg”) and 7 relevant circuits in the field of quantum computing. The Cuccaro Adder is a popular circuit for adding quantum registers. It is highly efficient and requires few aux… view at source ↗

read the original abstract

The scaling of quantum processors is currently limited by technical challenges such as decoherence and cross-talk. As the number of qubits grows, interference increases the computational noise. Distributed quantum computing addresses these limitations by interconnecting smaller, easier-to-handle quantum processors (cores), but it introduces the challenge of minimizing slow, error-prone inter-core communication. The task of distributing quantum circuits across cores while minimizing communication costs is known as the Qubit Allocation problem. This work focuses on developing a deep learning approach to this problem, emphasizing flexibility to quantum hardware topology and improving state-of-the-art performance. Heuristic and non-learning algorithms, such as the Hungarian Qubit Allocation (HQA), currently represent the state of the art. Reinforcement Learning (RL) approaches leverage learned allocation policies but often lack flexibility, requiring retraining when hardware configurations change, and they fall short of the solution quality achieved by non-learning methods. However, learning mechanisms could outperform human-crafted heuristics. To overcome these limitations, this work proposes a flexible, transformer-based architecture that can handle arbitrary numbers of qubits and cores without retraining. Results show that the trained policy consistently outperforms the previous RL state of the art and narrows the gap between RL and HQA for the most common circuits. It achieves a 33% reduction in allocation cost relative to the HQA for the Cuccaro Adder and 25% on average for random circuits. These findings show that learning-based approaches can effectively match the performance of hand-crafted heuristics, a crucial step towards their application in real-world scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The size-agnostic transformer RL allocator is the actual new piece, but its performance edge depends on unshown zero-shot transfer to unseen sizes and topologies.

read the letter

The core contribution is a transformer RL policy for qubit allocation that is meant to work across different numbers of qubits and cores without retraining. Earlier RL methods needed fresh training for each hardware change, so removing that step is a clear step past the cited baselines.

The paper does a reasonable job framing the practical issue in distributed quantum setups and showing that a learned policy can beat prior RL allocators while closing some of the gap to the Hungarian heuristic. The reported 33% cost drop on the Cuccaro Adder and 25% average on random circuits are the concrete numbers offered.

The main soft spot is the generalization claim itself. The abstract states the model handles arbitrary sizes and topologies after one training run, yet the stress-test note correctly flags that no evidence is given for transfer to hardware graphs or core counts materially outside the training distribution. If the state representation or masking still carries training-specific structure, the numerical gains cannot be credited to the size-agnostic design. Without the full experimental sections, circuit definitions, and transfer tests, it is impossible to judge whether the improvements are robust or tied to the specific cases shown.

This is for readers working on RL for quantum mapping or distributed architectures. Someone already following the HQA and prior RL lines would find the architecture idea worth examining. It is coherent enough on its own terms to deserve referee time, though the generalization evidence will need to be strengthened.

Referee Report

2 major / 1 minor

Summary. The paper proposes SQARL, a transformer-based reinforcement learning architecture for the qubit allocation problem in distributed quantum computing. It claims the method is size-agnostic (handling arbitrary qubit/core counts and topologies without retraining), outperforms prior RL baselines, and narrows the gap to the Hungarian Qubit Allocation (HQA) heuristic, with reported gains of 33% cost reduction versus HQA on the Cuccaro Adder and 25% average on random circuits.

Significance. If the size-agnostic generalization claim holds with verifiable zero-shot transfer, the result would be significant for practical RL deployment in quantum hardware, as it removes the retraining requirement that currently limits learned policies relative to topology-independent heuristics. The empirical narrowing of the RL-HQA gap, if robustly demonstrated, would credit the transformer encoder for enabling competitive learned allocation policies.

major comments (2)

[Abstract] Abstract: The central claim that the architecture 'can handle arbitrary numbers of qubits and cores without retraining' is load-bearing for the reported performance gains, yet the provided text contains no experimental section, table, or figure demonstrating zero-shot transfer to topologies or sizes materially outside the training distribution (e.g., 2-D to 3-D grids or 4-core to 16-core configurations). Without such evidence, the 33% and 25% improvements cannot be attributed to the claimed flexibility.
[Abstract] Abstract (performance claims): The headline results (33% reduction vs. HQA on Cuccaro Adder; 25% average on random circuits; outperforming prior RL) are presented without reference to training/test topology distributions, statistical tests, or ablations isolating the transformer attention mechanism, which is required to confirm that gains stem from size-agnostic properties rather than in-distribution fitting.

minor comments (1)

[Abstract] Abstract: The description of the RL algorithm (e.g., policy gradient variant or value function) and state representation details are omitted, which would aid clarity even in a high-level summary.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the detailed review and valuable feedback on our manuscript. We appreciate the emphasis on the importance of demonstrating the size-agnostic properties and the need for clearer experimental context in the abstract. Below, we provide point-by-point responses to the major comments and outline the revisions we will make to address them.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the architecture 'can handle arbitrary numbers of qubits and cores without retraining' is load-bearing for the reported performance gains, yet the provided text contains no experimental section, table, or figure demonstrating zero-shot transfer to topologies or sizes materially outside the training distribution (e.g., 2-D to 3-D grids or 4-core to 16-core configurations). Without such evidence, the 33% and 25% improvements cannot be attributed to the claimed flexibility.

Authors: We agree that the abstract, being a concise summary, does not include direct references or citations to the supporting experiments. The manuscript body includes experimental results that evaluate the trained policy on varying numbers of qubits and cores, as well as different topologies, without retraining, including instances outside the training distribution. We will revise the abstract to explicitly reference this experimental validation of zero-shot generalization, for example by adding a phrase such as 'supported by zero-shot evaluations on unseen sizes and topologies.' This will strengthen the link between the flexibility claim and the reported performance gains. revision: yes
Referee: [Abstract] Abstract (performance claims): The headline results (33% reduction vs. HQA on Cuccaro Adder; 25% average on random circuits; outperforming prior RL) are presented without reference to training/test topology distributions, statistical tests, or ablations isolating the transformer attention mechanism, which is required to confirm that gains stem from size-agnostic properties rather than in-distribution fitting.

Authors: We acknowledge that the abstract would benefit from additional context on the experimental conditions to better support attribution of the gains. The manuscript details the training and evaluation distributions, reports aggregated results with variability measures, and presents ablations on architectural components including the transformer. We will revise the abstract to briefly indicate the relevant setup, such as noting evaluation across diverse circuit sizes and topologies. This will help clarify that the improvements relate to the size-agnostic design. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results rest on external baselines.

full rationale

The paper reports trained RL policy performance on allocation cost for specific circuits (Cuccaro Adder, random circuits) against prior RL methods and the external HQA heuristic. No equations, self-citations, or fitted parameters are presented as load-bearing derivations that reduce to the method's own inputs by construction. The size-agnostic claim is an architectural assertion evaluated empirically rather than a self-referential definition or renamed known result. This is the common case of a self-contained empirical ML paper with independent external comparisons.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5823 in / 1015 out tokens · 26822 ms · 2026-06-29T19:07:44.870477+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages

[1]

Jnane, H., Undseth, B., Cai, Z., Benjamin, S

IBM Quantum blog, accessed 2026-02-12. Jnane, H., Undseth, B., Cai, Z., Benjamin, S. C., and Koczor, B. Multicore quantum comput- ing.Phys. Rev. Appl., 18:044064, October 2022. https://doi.org/10.1103/PhysRevApplied.18.044064. Kool, W., van Hoof, H., and Welling, M. Attention, learn to solve routing problems! InProceedings of the 7th International Confere...

work page doi:10.1103/physrevapplied.18.044064 2026
[2]

Luo, F., Lin, X., Liu, F., Zhang, Q., and Wang, Z

https://openreview.net/forum?id= ByxBFsRqYm. Luo, F., Lin, X., Liu, F., Zhang, Q., and Wang, Z. Neural combinatorial optimization with heavy decoder: Toward large scale generalization. InProceedings of the 37th International Conference on Neural Information Process- ing Systems, Red Hook, NY , USA, 2023. Curran As- sociates Inc. https://openreview.net/for...

work page doi:10.14569/ijacsa.2018.090354 2023

[1] [1]

Jnane, H., Undseth, B., Cai, Z., Benjamin, S

IBM Quantum blog, accessed 2026-02-12. Jnane, H., Undseth, B., Cai, Z., Benjamin, S. C., and Koczor, B. Multicore quantum comput- ing.Phys. Rev. Appl., 18:044064, October 2022. https://doi.org/10.1103/PhysRevApplied.18.044064. Kool, W., van Hoof, H., and Welling, M. Attention, learn to solve routing problems! InProceedings of the 7th International Confere...

work page doi:10.1103/physrevapplied.18.044064 2026

[2] [2]

Luo, F., Lin, X., Liu, F., Zhang, Q., and Wang, Z

https://openreview.net/forum?id= ByxBFsRqYm. Luo, F., Lin, X., Liu, F., Zhang, Q., and Wang, Z. Neural combinatorial optimization with heavy decoder: Toward large scale generalization. InProceedings of the 37th International Conference on Neural Information Process- ing Systems, Red Hook, NY , USA, 2023. Curran As- sociates Inc. https://openreview.net/for...

work page doi:10.14569/ijacsa.2018.090354 2023