pith. sign in

arxiv: 2605.22463 · v1 · pith:BLKKWE25new · submitted 2026-05-21 · 🪐 quant-ph · cs.LG

Reinforcement learning for ion shuttling on trapped-ion quantum computers

Pith reviewed 2026-05-22 05:37 UTC · model grok-4.3

classification 🪐 quant-ph cs.LG
keywords reinforcement learningion shuttlingtrapped-ion quantum computingmodular architecturesquantum hardware optimizationquantum controlscalable quantum computing
0
0 comments X

The pith

Reinforcement learning optimizes ion shuttling on trapped-ion quantum computers and cuts operations by up to 36.3 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Trapped-ion quantum computers rely on modular chips that divide tasks into separate zones for storage, preparation, and gates. Moving ions between these zones is called shuttling, and the number of required moves grows rapidly with more ions, turning it into a hard optimization task. The paper shows that a reinforcement learning agent can discover better shuttling strategies by trial and error inside a simulation of the device. This learned policy reduces the total number of shuttling steps by as much as 36.3 percent compared with standard heuristic rules. The same method works on several different chip layouts, giving hardware designers a practical way to check shuttling performance early in the design process.

Core claim

We demonstrate the first use of reinforcement learning for optimizing ion shuttling. The RL agent learns a shuttling policy through direct interaction with a simulation of the modular trapped-ion architecture. This policy outperforms existing heuristic techniques and reduces the number of shuttling operations by up to 36.3 percent. The approach applies readily to multiple chip architectures and supplies a tool for evaluating shuttling efficiency while designing future, more complex hardware.

What carries the argument

Reinforcement learning agent that learns a policy for choosing ion transport steps to minimize total shuttling operations in a simulated modular chip.

If this is right

  • Fewer shuttling steps lower the chance of errors during transport, supporting more reliable quantum circuits.
  • The method scales to larger ion numbers where manual or heuristic planning becomes impractical.
  • Designers can test proposed chip layouts for shuttling cost before fabrication.
  • The same RL framework can be reused across different zone arrangements with little extra tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the policy transfers well to hardware, it could shorten the engineering cycle for scaling modular ion traps.
  • Analogous RL techniques might later optimize other real-time control tasks such as gate tuning or error correction scheduling.
  • Combining the shuttling optimizer with full-circuit simulators would let researchers measure end-to-end speedups on larger algorithms.

Load-bearing premise

The simulation used for training must capture the main physical constraints and noise so that the learned policy works on real hardware without retraining.

What would settle it

Deploy the trained RL policy on a physical trapped-ion processor and count whether it performs fewer shuttling operations than the best current heuristic method on the same circuit.

Figures

Figures reproduced from arXiv: 2605.22463 by Bodo Rosenhahn, Christian Staufenbiel, Daniel Borcherding, Lea Richtmann, Maximilian Schier, Mich\`ele Heurs, Tobias Schmale.

Figure 1
Figure 1. Figure 1: FIG. 1. Layout of the CIRQLE ion trap chip, also called [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: FIG. 2. Proposed representation. The top left shows the chip state. A chip-specific adapter translates the chip state into a [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: FIG. 3. Comparison of ion shuttling durations using trajecto [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: FIG. 4. Comparison of ion shuttling duration of our pro [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: FIG. 5. Ion shuttling duration for different architectures op [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: FIG. 6. Detailed study on the influence of numeric encodings [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
read the original abstract

Scalable trapped-ion quantum computing is commonly realized with modular chips that feature distinct zones with specific functionalities, such as storage, state preparation, and gate execution. To execute a quantum circuit, the ions must be transported between these zones. This process is called ion shuttling. To achieve reliable computation results, the shuttling process must be optimized. However, as the number of ions increases, this becomes a high-dimensional optimization problem where optimal solutions cannot be computed efficiently. We demonstrate, to the best of our knowledge, the first use of reinforcement learning (RL) for the optimization of ion shuttling. RL is well-suited for such scenarios, as it enables learning a strategy through direct interaction with the problem. We show that our RL approach outperforms current state-of-the-art heuristic techniques, yielding a reduction in shuttling operations of up to 36.3 %. Furthermore, we show that our method is easily applicable to various chip architectures. Our approach offers a versatile method to study shuttling efficiency during chip design and, therefore, a highly relevant tool for future, more complex architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents a reinforcement learning (RL) method for optimizing ion shuttling in modular trapped-ion quantum computing architectures. It claims to be the first application of RL to this problem and reports that the approach reduces the number of shuttling operations by up to 36.3% relative to existing heuristic techniques while remaining applicable across different chip layouts.

Significance. If the simulator faithfully captures the dominant physical constraints and the learned policies transfer to hardware, the work would supply a practical, scalable tool for studying and improving shuttling efficiency during the design of complex trapped-ion chips. The absence of hardware validation and simulation-fidelity metrics currently limits the strength of this assessment.

major comments (2)
  1. [Abstract] Abstract: The headline performance claim of a 36.3% reduction in shuttling operations is obtained entirely inside an author-defined simulator. No information is supplied on the motional-heating rates, voltage-noise spectra, inter-zone coupling strengths, or other error channels included in the environment, nor are any closed-loop hardware experiments reported that would close the sim-to-real gap. This directly affects the load-bearing claim that the method is useful for real devices.
  2. [Abstract] The manuscript states that the RL policy outperforms 'current state-of-the-art heuristic techniques' but provides neither the explicit definitions of those heuristics nor quantitative tables comparing operation counts, fidelity, or runtime across multiple ion numbers and architectures. Without these baselines the magnitude of the reported improvement cannot be independently verified.
minor comments (1)
  1. [Abstract] The abstract asserts applicability to 'various chip architectures' but does not indicate whether the same reward function and state representation were used without modification or whether architecture-specific retraining was required.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their insightful comments, which have helped us clarify the scope and presentation of our work. We provide point-by-point responses to the major comments below, indicating where revisions will be incorporated.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline performance claim of a 36.3% reduction in shuttling operations is obtained entirely inside an author-defined simulator. No information is supplied on the motional-heating rates, voltage-noise spectra, inter-zone coupling strengths, or other error channels included in the environment, nor are any closed-loop hardware experiments reported that would close the sim-to-real gap. This directly affects the load-bearing claim that the method is useful for real devices.

    Authors: We agree that additional details on the simulator would improve transparency. Our environment models the discrete zone assignments and transport operations central to modular trapped-ion architectures, with simplified representations of physical constraints to enable scalable RL training. We will revise the methods section to explicitly list the included assumptions (e.g., idealized transport times and basic heating estimates) and any omitted channels. We acknowledge that the work does not include hardware validation or full error-channel fidelity metrics; as a simulation study demonstrating RL feasibility, we will add a limitations paragraph discussing the sim-to-real gap and suggesting future experimental directions, but we cannot report closed-loop hardware results at this stage. revision: partial

  2. Referee: [Abstract] The manuscript states that the RL policy outperforms 'current state-of-the-art heuristic techniques' but provides neither the explicit definitions of those heuristics nor quantitative tables comparing operation counts, fidelity, or runtime across multiple ion numbers and architectures. Without these baselines the magnitude of the reported improvement cannot be independently verified.

    Authors: We will revise the manuscript to provide explicit definitions of the baseline heuristics, including nearest-zone greedy assignment and shortest-path routing methods drawn from prior trapped-ion literature. We will also add quantitative comparison tables and supplementary figures reporting operation counts, estimated fidelities, and wall-clock runtimes for the RL policy versus these heuristics, evaluated across ion numbers from 4 to 20 and at least three distinct chip layouts. These additions will allow direct verification of the maximum 36.3% reduction in shuttling operations. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical RL performance comparison in author-defined simulator

full rationale

The paper reports an empirical result: an RL policy trained inside a custom simulator achieves up to 36.3% fewer shuttling operations than heuristics. No equations, derivations, or uniqueness theorems are presented whose outputs reduce by construction to the inputs or to self-citations. The performance metric is measured directly against the same simulator used for training; this is a standard empirical benchmark, not a definitional or fitted-input circularity. The sim-to-real transfer gap is a validity concern, not a circularity in the reported chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are described in the provided text.

pith-pipeline@v0.9.0 · 5741 in / 1065 out tokens · 51196 ms · 2026-05-22T05:37:41.864158+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 4 internal anchors

  1. [1]

    It consists of four registers connected by an X-junction, we refer to it as the “X-chip”

    Example architecture 1: X-chip Our first example architecture is the QVLS QROSS chip [23], the first proposal for a QCCD developed by Quantum Valley Lower Saxony (QVLS) [16]. It consists of four registers connected by an X-junction, we refer to it as the “X-chip”. The registers include a compute zone that can hold up to 2 ions, a state preparation and mea...

  2. [2]

    Here, the storage zone is consolidated in a ring

    Example architecture 2: Q-chip We also study an alternative chip design, the QVLS CIRQLE chip [24], with a more compact storage register, allowing more ions to fit on the same chip size. Here, the storage zone is consolidated in a ring. The compute zone (capacity 2 ions) and the SPAM zone (capacity 1 ion) are connected to this ring via a junction, resulti...

  3. [3]

    [10] also implement proximity sorting and use the readout zone as temporary storage to make ions needed shortly after easily accessible

    store ions that are used together in proximity. [10] also implement proximity sorting and use the readout zone as temporary storage to make ions needed shortly after easily accessible. Handling traffic blocks:Different heuristics are used in order to avoid an ion from being blocked because the path it has to take is occupied by other ions [13]. This is do...

  4. [4]

    With this formula, they can then determine whether an ion should move or stay in the same trap

    propose a probabilistic formula that accounts for several heuristics, such asleast movementsandhandling traffic blocks. With this formula, they can then determine whether an ion should move or stay in the same trap. The mentioned strategies are planning strategies; a shuttling protocol is developed for a given circuit in ad- vance and then executed on the...

  5. [5]

    Reference 1: Heuristic compiler In the framework of the QVLS X-chip, a shuttling com- piler was developed to address the ion shuttling problem 4 using heuristics derived from observations of the chip’s architecture [10]. The challenge of orchestrating ions across the chip to execute a given quantum circuit was therefore decomposed into several phases, two...

  6. [6]

    SAT solvers

    Reference 2: SAT solver For benchmarking purposes, it is useful to compare ob- tained trajectories against optimal ones. While finding these optimal trajectories is likely unfeasible in the gen- eral case, we can at least study small instances to gain some basic insight. In principle, a naive exhaustive search through all shut- tling sequences of a fixed ...

  7. [7]

    vectors, using some transformation

    Requirements for representations When employing a neural network as a policyπ, the state spaceSmust usually be transformed to compatible representations, e.g. vectors, using some transformation. For simplicity in a slight abuse of notation we useSas the representation space directly. A good representation of the chip state and circuit to be executed shoul...

  8. [8]

    This is illustrated in Figure 2 for a lookahead ofk lookahead = 2

    Proposed representation The core idea of our proposed representation is ab- stracting the qubit label and sequence position of a two- qubit gate by encoding a gate through the cell-location of the other operand and the depth of the gate in the dependency graph. This is illustrated in Figure 2 for a lookahead ofk lookahead = 2. The following steps are per- formed:

  9. [9]

    Cell” and “Qubit

    A chip-specific adapter translates the chip state (top left) into a tabular formK(columns “Cell” and “Qubit” on the right). In our case the adapter simply iterates all zones starting with the position next to the junction

  10. [10]

    If the circuit (bottom left) is given as a list of gates, the directed acyclic graph of the circuit is calcu- 7 1 2 3 4 5 g1 g2 g3 g4 Circuit of Two-Qubit Gates g1 1 3 g2 2 4 g3 1 5 g4 1 3 Depth 0 Depth 1 Depth 2 Directed Circuit Graph 1 2 34 5 Storage Compute Spam Chip State Storage Adapter Encoding M (1, 5, 6) (1, 10, ⋄) (0, ⋄, ⋄) (0, ⋄, ⋄) (1, 2, ⋄) (1...

  11. [11]

    For each cell, it is encoded whether it is occupied by a qubit

    The encoding matrixMis computed (right). For each cell, it is encoded whether it is occupied by a qubit. Next, for depths in{0, . . . , k lookahead −1}, it is checked if a gate at that depth exists with the qubit of the current cell. If it exists, the cell of the other operand is encoded. Otherwise, an empty token⋄is encoded. Gates at a depth ofk lookahea...

  12. [12]

    Shaped reward The basic reward signal for a goal-reaching problem is very sparse, as the agent receives a negative reward at a constant ratec r until a goal state is reached. If the problem only terminates upon reaching a goal state and the agent has not encountered any goal states yet, the value of every state must be estimated as V=−c r R ∞ 0 e−βtdt=− c...

  13. [13]

    A starting state is generated by first drawing the number of ions or qubits on the chip: z∼Uniform({2,

    Problem generation during training When training the RL agent, a diverse range of starting states is desirable, such that the entire possible problem space is well covered. A starting state is generated by first drawing the number of ions or qubits on the chip: z∼Uniform({2, . . . , n max}). Here,n max is the maximum number of ions supported. The qubits a...

  14. [14]

    C. D. Bruzewicz, J. Chiaverini, R. McConnell, and J. M. Sage, Trapped-ion quantum computing: Progress and challenges, Applied Physics Reviews6, 021314 (2019)

  15. [15]

    J. I. Cirac and P. Zoller, Quantum computations with cold trapped ions, Physical Review Letters74, 4091 (1995)

  16. [16]

    Sørensen and K

    A. Sørensen and K. Mølmer, Quantum Computation with Ions in Thermal Motion, Physical Review Letters82, 1971 (1999)

  17. [17]

    Zarantonello, H

    G. Zarantonello, H. Hahn, J. Morgner, M. Schulte, A. Bautista-Salvador, R. F. Werner, K. Hammerer, and C. Ospelkaus, Robust and Resource-Efficient Microwave Near-Field Entangling Be + 9 Gate, Physical Review Let- ters123, 260503 (2019)

  18. [18]

    Kielpinski, C

    D. Kielpinski, C. Monroe, and D. J. Wineland, Architec- ture for a large-scale ion-trap quantum computer, Nature 417, 709 (2002)

  19. [19]

    J. M. Pino, J. M. Dreiling, C. Figgatt, J. P. Gaebler, S. A. Moses, M. S. Allman, C. H. Baldwin, M. Foss-Feig, D. Hayes, K. Mayer, C. Ryan-Anderson, and B. Neyen- huis, Demonstration of the trapped-ion quantum CCD computer architecture, Nature592, 209 (2021)

  20. [20]

    S. A. Moses, C. H. Baldwin, M. S. Allman, R. An- cona, L. Ascarrunz, C. Barnes, J. Bartolotta, B. Bjork, P. Blanchard, M. Bohn, J. G. Bohnet, N. C. Brown, N. Q. Burdick, W. C. Burton, S. L. Campbell, J. P. Campora, C. Carron, J. Chambers, J. W. Chan, Y. H. Chen, A. Chernoguzov, E. Chertkov, J. Colina, J. P. Curtis, R. Daniel, M. DeCross, D. Deen, C. Delan...

  21. [21]

    Durandau, J

    J. Durandau, J. Wagner, F. Mailhot, C.-A. Brunet, F. Schmidt-Kaler, U. Poschinger, and Y. B´ erub´ e- Lauzi` ere, Automated Generation of Shuttling Sequences for a Linear Segmented Ion Trap Quantum Computer, Quantum7, 1175 (2023)

  22. [22]

    Helios: A 98-qubit trapped-ion quantum computer

    A. Ransford, M. S. Allman, J. Arkinstall, J. P. Campora, S. F. Cooper, R. D. Delaney, J. M. Dreiling, B. Estey, C. Figgatt, A. Hall, A. A. Husain, A. Isanaka, C. J. Kennedy, N. Kotibhaskar, I. S. Madjarov, K. Mayer, A. R. Milne, A. J. Park, A. P. Reed, R. Ancona, M. P. Andersen, P. Andres-Martinez, W. Angenent, L. Ar- gueta, B. Arkin, L. Ascarrunz, W. Bak...

  23. [23]

    Schmale, B

    T. Schmale, B. Temesi, A. Baishya, N. Pulido-Mateo, L. Krinner, T. Dubielzig, C. Ospelkaus, H. Weimer, and D. Borcherding, Backend compiler phases for trapped-ion quantum computers, in2022 IEEE International Confer- ence on Quantum Software (QSW)(2022) pp. 32–37

  24. [24]

    A. A. Saki, R. O. Topaloglu, and S. Ghosh, Muzzle the Shuttle: Efficient Compilation for Multi-Trap Trapped- Ion Quantum Computers, in2022 Design, Automation & Test in Europe Conference & Exhibition (DATE)(2022) pp. 322–327

  25. [25]

    Murali, D

    P. Murali, D. M. Debroy, K. R. Brown, and M. Martonosi, Architecting Noisy Intermediate-Scale Trapped Ion Quantum Computers, in2020 ACM/IEEE 47th Annual International Symposium on Computer Ar- chitecture (ISCA)(2020) pp. 529–542

  26. [26]

    X. Wu, C. Zhu, J. Wang, and X. Wang, MUSS-TI: Multi- level Shuttle Scheduling for Large-Scale Entanglement Module Linked Trapped-Ion (2025), arXiv:2509.25988 [quant-ph]

  27. [27]

    Schoenberger, S

    D. Schoenberger, S. Hillmich, M. Brandl, and R. Wille, Shuttling for Scalable Trapped-Ion Quantum Computers, IEEE Transactions on Computer-Aided Design of Inte- grated Circuits and Systems44, 2144 (2025)

  28. [28]

    W. Dai, K. A. Brown, and T. G. Robertazzi, Ad- vanced Shuttle Strategies for Parallel QCCD Architec- tures, IEEE Transactions on Quantum Engineering5, 1 (2024)

  29. [29]

    V.,https://qvls.de/ en/(2026), accessed: 2026-05-20

    Quantum Valley Lower Saxony e. V.,https://qvls.de/ en/(2026), accessed: 2026-05-20

  30. [30]

    Wille, L

    R. Wille, L. Berent, T. Forster, J. Kunasaikaran, K. Mato, T. Peham, N. Quetschlich, D. Rovara, A. Sander, L. Schmid, D. Schoenberger, Y. Stade, and L. Burgholzer, The MQT handbook: A summary of de- sign automation tools and software for quantum com- puting, inIEEE International Conference on Quantum Software (QSW)(2024) 2405.17543

  31. [31]

    A. W. Cross, L. S. Bishop, S. Sheldon, P. D. Nation, and J. M. Gambetta, Validating quantum computers us- ing randomized model circuits, Physical Review A100, 032328 (2019)

  32. [32]

    qvls-q1.de/forschung(2026), accessed: 2026-05-20

    Quantum Valley Lower Saxony Q1 - Forschung,www. qvls-q1.de/forschung(2026), accessed: 2026-05-20

  33. [33]

    Schoenberger, S

    D. Schoenberger, S. Hillmich, M. Brandl, and R. Wille, Using Boolean Satisfiability for Exact Shuttling in Trapped-Ion Quantum Computers, in2024 29th Asia and South Pacific Design Automation Conference (ASP- DAC)(2024) pp. 127–133

  34. [34]

    R. B. Blakestad,Transport of Trapped-Ion Qubits within a Scalable Quantum Processor, Ph.D. thesis, University of Colorado (2010)

  35. [35]

    On the Transport of Atomic Ions in Linear and Multidimensional Ion Trap Arrays

    D. Hucul, M. Yeo, W. K. Hensinger, J. Rabchuk, S. Olm- schenk, and C. Monroe, On the Transport of Atomic Ions in Linear and Multidimensional Ion Trap Arrays (2008), arXiv:quant-ph/0702175

  36. [36]

    Ungerechts, R

    F. Ungerechts, R. Munoz, A. Hoffmann, J. B¨ atge, M. M. Billah, T. Meiners, B. Kaune, G. Zarantonello, and C. Ospelkaus, Designing a Trapped-Ion Quantum Pro- cessor based on Near-Field Microwave Quantum Logic Gates (2026), to be published

  37. [37]

    Ungerechts, J

    F. Ungerechts, J. B¨ atge, M. M. Billah, L. Krieger, R. Munoz, P. Nuschke, A. Hoffmann, G. Zarantonello, and C. Ospelkaus, CIRQLE: A Comprehensive Register- Based Trapped-Ion Quantum Processor with Near-Field Microwave Control (2026), to be published

  38. [38]

    Bowler, J

    R. Bowler, J. Gaebler, Y. Lin, T. R. Tan, D. Han- neke, J. D. Jost, J. P. Home, D. Leibfried, and D. J. Wineland, Coherent Diabatic Ion Transport and Separa- tion in a Multizone Trap Array, Physical Review Letters 109, 080502 (2012)

  39. [39]

    Walther, F

    A. Walther, F. Ziesel, T. Ruster, S. T. Dawkins, K. Ott, M. Hettrich, K. Singer, F. Schmidt-Kaler, and U. Poschinger, Controlling Fast Transport of Cold Trapped Ions, Physical Review Letters109, 080501 (2012)

  40. [40]

    X.-J. Lu, A. Ruschhaupt, and J. G. Muga, Fast shut- tling of a particle under weak spring-constant noise of the moving trap, Physical Review A97, 053402 (2018)

  41. [41]

    Kaushal, B

    V. Kaushal, B. Lekitsch, A. Stahl, J. Hilder, D. Pijn, C. Schmiegelow, A. Bermudez, M. M¨ uller, F. Schmidt- Kaler, and U. Poschinger, Shuttling-based trapped-ion quantum information processing, AVS Quantum Science 2, 014101 (2020)

  42. [42]

    Schoenberger and R

    D. Schoenberger and R. Wille, Orchestrating Multi-Zone Shuttling in Trapped-Ion Quantum Computers, in2025 IEEE International Conference on Quantum Computing and Engineering (QCE), Vol. 01 (2025) pp. 1069–1075

  43. [43]

    Schoenberger, J

    D. Schoenberger, J. Hilder, F. Schmidt-Kaler, and R. Wille, Shuttling for Trapped-Ion Quantum Computers with Embedded Processing Zones, in2025 IEEE Interna- tional Conference on Quantum Software (QSW)(2025) pp. 123–129

  44. [44]

    Schmale, Hybrid quantum-classical computation – from infrastructure to algorithms, Institutionelles Repositorium der Leibniz Universit¨ at Hannover 10.15488/20338 (2026)

    T. Schmale, Hybrid quantum-classical computation – from infrastructure to algorithms, Institutionelles Repositorium der Leibniz Universit¨ at Hannover 10.15488/20338 (2026)

  45. [45]

    Sipser, Introduction to the theory of computation, ACM Sigact News27, 27 (1996)

    M. Sipser, Introduction to the theory of computation, ACM Sigact News27, 27 (1996)

  46. [46]

    Bukov and F

    M. Bukov and F. Marquardt, Reinforcement Learning for Quantum Technology (2026), arXiv:2601.18953 [quant- ph]

  47. [47]

    M. L. Puterman,Markov decision processes: discrete stochastic dynamic programming(John Wiley & Sons, 2014)

  48. [48]

    R. S. Sutton and A. G. Barto,Reinforcement learning: an introduction, second edition ed., edited by F. Bach (MIT press, 2018)

  49. [49]

    R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour, Policy gradient methods for reinforcement learning with function approximation, Advances in neural information processing systems12(1999)

  50. [50]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, Proximal policy optimization algorithms, arXiv preprint arXiv:1707.06347 (2017)

  51. [51]

    High-Dimensional Continuous Control Using Generalized Advantage Estimation

    J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, High-dimensional continuous control us- ing generalized advantage estimation, arXiv preprint arXiv:1506.02438 (2015). 15

  52. [52]

    Givan, T

    R. Givan, T. Dean, and M. Greig, Equivalence notions and model minimization in markov decision processes, Artificial intelligence147, 163 (2003)

  53. [53]

    Vaswani, N

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, At- tention is all you need, Advances in neural information processing systems30(2017)

  54. [54]

    Espeholt, H

    L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, et al., Impala: Scalable distributed deep-RL with im- portance weighted actor-learner architectures, inInter- national conference on machine learning(PMLR, 2018) pp. 1407–1416

  55. [55]

    H. Lee, D. Hwang, D. Kim, H. Kim, J. J. Tai, K. Subra- manian, P. R. Wurman, J. Choo, P. Stone, and T. Seno, Simba: Simplicity bias for scaling up parameters in deep reinforcement learning, in13th International Conference on Learning Representations, ICLR 2025(International Conference on Learning Representations, ICLR, 2025) pp. 50050–50082

  56. [56]

    A. Y. Ng, D. Harada, and S. Russell, Policy invariance under reward transformations: Theory and application to reward shaping, inIcml, Vol. 99 (1999) pp. 278–287

  57. [57]

    See Supplemental Material at [URL will be inserted by publisher] for all results on MQT circuits and animations of some shuttling sequences

  58. [58]

    De Moura and N

    L. De Moura and N. Bjørner, Z3: An efficient SMT solver, inInternational conference on Tools and Algo- rithms for the Construction and Analysis of Systems (Springer, 2008) pp. 337–340. Supplementary Material: Reinforcement learning for ion shuttling on trapped-ion quantum computers Maximilian Schier ∗ and Bodo Rosenhahn Institute for Information Processin...