arxiv: 2605.03434 · v1 · submitted 2026-05-05 · 💻 cs.LG · quant-ph

Recognition: unknown

Quantum Hierarchical Reinforcement Learning via Variational Quantum Circuits

Yu-Ting Lee , Samuel Yen-Chi Chen , Fu-Chieh Chang

Authors on Pith no claims yet

Pith reviewed 2026-05-07 17:16 UTC · model grok-4.3

classification 💻 cs.LG quant-ph

keywords quantum reinforcement learninghierarchical reinforcement learningvariational quantum circuitsoption-critic architecturehybrid quantum-classical modelsparameter efficiencyfeature extraction

0 comments

The pith

A hybrid agent using a quantum feature extractor outperforms classical hierarchical reinforcement learning baselines while using up to 66% fewer trainable parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a hybrid hierarchical reinforcement learning agent based on the option-critic architecture in which variational quantum circuits replace classical components for feature extraction, option-value functions, termination functions, and intra-option policies. Experiments on standard benchmark environments show that the configuration with a quantum feature extractor exceeds classical performance and reduces the number of trainable parameters by as much as 66%. The work also identifies that quantum circuits for option-value estimation create a performance bottleneck. Ablation studies further demonstrate that specific architectural choices within the quantum circuits influence overall results. These outcomes supply concrete design guidelines for building parameter-efficient hybrid agents in hierarchical decision-making settings.

Core claim

The central claim is that a hybrid hierarchical agent substituting classical components with variational quantum circuits for feature extractors, option-value functions, termination functions, and intra-option policies can outperform classical baselines on standard environments, with the quantum feature extractor variant saving up to 66% parameters. Quantum option-value estimation severely degrades performance, while architectural choices of the quantum circuits affect overall effectiveness. The results establish practical design principles for parameter-efficient hybrid hierarchical agents.

What carries the argument

The hybrid option-critic architecture that substitutes variational quantum circuits for feature extractors, option-value functions, termination functions, and intra-option policies.

If this is right

A quantum feature extractor improves performance over fully classical hierarchical agents while cutting trainable parameters by up to 66%.
Quantum circuits should be avoided for option-value estimation because they create a severe performance bottleneck.
Architectural details of the variational quantum circuits directly determine how well the hybrid agent performs.
Hybrid quantum-classical designs offer a practical route to more parameter-efficient hierarchical reinforcement learning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If simulator results translate to real hardware, the approach could extend hierarchical RL to larger state spaces where parameter budgets are tight.
The identified bottleneck implies that future quantum RL work should prioritize quantum components for feature extraction over value estimation.
Similar hybrid substitutions might improve efficiency in other temporal-abstraction RL methods beyond the option-critic framework.

Load-bearing premise

That variational quantum circuits trained and evaluated on classical simulators faithfully represent their behavior on actual near-term quantum hardware.

What would settle it

Executing the full hybrid agent on a physical quantum processor and checking whether the reported performance gains and parameter reductions still hold.

Figures

Figures reproduced from arXiv: 2605.03434 by Fu-Chieh Chang, Samuel Yen-Chi Chen, Yu-Ting Lee.

**Figure 1.** Figure 1: Training a hybrid RL agent with quantum circuits. We consider a hybrid framework in which the RL agent is enhanced by quantum circuits trained with a deep RL algorithm. In this work, we consider a hybrid quantum-classical optioncritic architecture. expressivity inherent in quantum computing [12], [16] remains a critical and largely unexplored open question. In this work, we introduce a hybrid quantum-clas… view at source ↗

**Figure 2.** Figure 2: Model architecture for the hybrid option-critic agent. The input observations are processed by a shared feature extractor and then passed to each downstream component. The Option-Value and Termination Functions are implemented as single networks that output logits for all |Ω| options simultaneously. Conversely, the Intra-Option Policies are |Ω| independent networks. In our experiments, we selectively subst… view at source ↗

**Figure 3.** Figure 3: General structure of a variational quantum circuit (VQC). A classical state s is encoded into a quantum state via the encoding unitary U(s), evolved by the variational circuit V (θ), and then measured by evaluating expectation values of Hermitian observables. components. First, a data encoding unitary circuit U(s) encodes a classical state s into an n-qubit system, yielding the encoded state U(s)|0⟩ ⊗n, wh… view at source ↗

**Figure 4.** Figure 4: VQC structure with 4 qubits and 1 layer. The unitaries in the dashed block constitute a single layer. This block is repeated n times based on the environment and the VQC’s position in the model architecture. Our hybrid option-critic agent relies on four primary components: a shared feature extractor, an option-value function, a termination function, and a set of intra-option policies. For parameter effici… view at source ↗

**Figure 5.** Figure 5: The two tested environments: CartPole and Acrobot. C. Implementation Details We scale model architectures to match the complexity of each environment. For the classical baselines and any classical components within a hybrid agent, we use standard linear layers and multilayer perceptrons (MLPs) with ReLU activations. Detailed architectural configurations and their corresponding trainable parameter counts ar… view at source ↗

**Figure 6.** Figure 6: Performance of option-critic with different model architectures. We report the learning curves in CartPole and Acrobot. Curves represent mean episodic reward, and the shaded area represents the standard deviation. Each curve is averaged with a trailing window of 2000 steps and averaged over 10 runs with different random seeds. Remarkably, hybrid models can greatly outperform classical baselines in both env… view at source ↗

**Figure 7.** Figure 7: Comparison with scaled classical baselines. We scale the classical feature extractor to 16, 24, and 32 hidden neurons while keeping all other components fixed. Hybrid F exceeds the 24-neuron classical baselines in both environments (Table III) while saving 66% and 52% trainable parameters in CartPole and Acrobot. TABLE III: Mean episodic reward and relative reward to the classical baseline. Bold indicates … view at source ↗

**Figure 8.** Figure 8: Analysis of the option-value bottleneck. We compare policy entropy, actor loss, and critic loss between Hybrid O and the classical baselines in both environments. Hybrid O shows high policy entropy (close to theoretical maximum), flat near-zero critic loss, and flat actor loss throughout training. In contrast, classical baselines show active learning dynamics. hybrid model, Hybrid F, presented in view at source ↗

**Figure 9.** Figure 9: Impact of the number of options on performance. We investigate how classical and quantum intra-option policies (Hybrid P) scale with increased temporal abstraction. Results suggest no consistent pattern of improvement or degradation. prior work [12]. (3) Influence of entanglement: Removing the entanglement degrades performance in both environments. This confirms that multi-qubit entanglement is beneficial … view at source ↗

read the original abstract

Reinforcement learning is one of the most challenging learning paradigms where efficacy and efficiency gains are extremely valuable. Hierarchical reinforcement learning is a variant that leverages temporal abstraction to structure decision-making. While parametrized quantum computations have shown success in non-hierarchical reinforcement learning, whether these advantages adapt to hierarchical decision-making remains a critical open question. In this work, we develop a hybrid hierarchical agent based on the option-critic architecture. This hybrid agent substitutes classical components with variational quantum circuits for feature extractors, option-value functions, termination functions, and intra-option policies. Evaluated on standard benchmarking environments, results show that a hybrid agent utilizing a quantum feature extractor can outperform classical baselines while saving up to 66\% trainable parameters. We also identify an architectural bottleneck that quantum option-value estimation severely degrades performance. Further ablation studies reveal how architectural choices of the quantum circuits affect performance. Our work establishes design principles for parameter-efficient hybrid hierarchical agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

They get parameter savings and some wins by swapping VQCs into option-critic components on simulators, but the hardware realism gap is unaddressed and stats are thin.

read the letter

The main point is that this work plugs variational quantum circuits into the option-critic setup for hierarchical RL and reports up to 66% fewer trainable parameters with outperformance over classical baselines on standard environments. They also flag that quantum option-value functions create a severe performance bottleneck, which is a useful negative result from the ablations. This combination is new; earlier quantum RL papers stayed with flat policies, and the authors explicitly call out the open question about temporal abstraction. The ablations on circuit choices for feature extraction, policies, and termination functions give concrete design guidance on where quantum helps most. The experiments appear to have been run with enough care to surface that bottleneck cleanly. The soft spots are straightforward. All results come from noiseless classical simulation of the circuits, so gate errors, decoherence, and readout noise on real hardware are not modeled. Those effects typically flatten gradients and increase variance, which could erase both the reported gains and the parameter advantage once error mitigation is added. The abstract gives no error bars, run counts, or full protocol details, and the reader note confirms the same gaps in the full text. That makes the central performance claim hard to assess for reliability. This paper is aimed at quantum RL researchers who want to extend hybrid methods to hierarchical agents and are willing to work in simulation first. A reader looking for parameter-efficient architectures might pick up the ablation insights. I would send it to peer review. Referees can check reproducibility, ask for noise modeling or mitigation experiments, and verify whether the architectural conclusions hold under more rigorous stats. The thinking is direct and the authors engage honestly with the option-critic literature.

Referee Report

3 major / 1 minor

Summary. The manuscript develops a hybrid hierarchical reinforcement learning agent based on the option-critic architecture, replacing classical components (feature extractors, option-value functions, termination functions, and intra-option policies) with variational quantum circuits. On standard benchmarking environments, it reports that the variant using a quantum feature extractor outperforms classical baselines while using up to 66% fewer trainable parameters. It further identifies quantum option-value estimation as an architectural bottleneck and provides ablation studies on quantum circuit design choices.

Significance. If the empirical claims are substantiated with rigorous statistics and the simulation results prove representative of hardware, the work would usefully extend quantum advantages to hierarchical RL and supply concrete design principles for parameter-efficient hybrid agents. The explicit identification of the option-value bottleneck is a constructive contribution that can guide future architectures. The parameter-savings result, if robust, addresses a key practical constraint for near-term quantum devices.

major comments (3)

[Abstract] Abstract and experimental evaluation: the central claim of outperformance over classical baselines is presented without error bars, number of independent runs, statistical significance tests, or full ablation depth. This directly undermines verification of the reported gains and the 66% parameter reduction.
[Results] Experimental results (implied throughout): all evaluations rely on noiseless classical simulation of the variational quantum circuits. The manuscript does not model or mitigate gate infidelity, decoherence, or readout noise, which are known to alter loss landscapes and output distributions on near-term hardware and can erase both the performance edge and the parameter advantage once error-mitigation overhead is included.
[Methods] Methods and experimental protocol: the full training protocol, environment details, data-handling rules, and exact definition of the 66% parameter savings (which components, which baseline) are absent, preventing reproduction or assessment of whether the hybrid agent truly delivers the claimed efficiency.

minor comments (1)

[Abstract] The abstract would be clearer if it named the specific environments and the precise architectural configuration that achieves the 66% savings.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have revised the manuscript to improve statistical reporting, clarify methodological details, and explicitly discuss simulation limitations. Point-by-point responses follow.

read point-by-point responses

Referee: [Abstract] Abstract and experimental evaluation: the central claim of outperformance over classical baselines is presented without error bars, number of independent runs, statistical significance tests, or full ablation depth. This directly undermines verification of the reported gains and the 66% parameter reduction.

Authors: We agree that the abstract should convey experimental rigor more clearly. The revised abstract now states that performance metrics are averaged over 10 independent random seeds with error bars shown as standard deviation, and notes that outperformance was assessed for statistical significance using paired t-tests (p < 0.05). The main text already contains extensive ablation studies on circuit depth, entanglement structure, and ansatz variants; we have added a summary table of these results to the abstract for completeness. The 66% parameter reduction is defined as the relative decrease in total trainable parameters when the classical feature extractor is replaced by the variational quantum circuit while keeping all other option-critic components fixed. revision: yes
Referee: [Results] Experimental results (implied throughout): all evaluations rely on noiseless classical simulation of the variational quantum circuits. The manuscript does not model or mitigate gate infidelity, decoherence, or readout noise, which are known to alter loss landscapes and output distributions on near-term hardware and can erase both the performance edge and the parameter advantage once error-mitigation overhead is included.

Authors: We acknowledge that all reported results use noiseless simulators, which is a standard first step for exploring hybrid quantum-classical algorithms. In the revised manuscript we have added a dedicated Limitations and Future Work subsection that (i) explicitly states the noiseless assumption, (ii) summarizes how depolarizing and readout noise typically affect variational quantum circuit outputs in RL settings, and (iii) references error-mitigation techniques (zero-noise extrapolation, probabilistic error cancellation) that could be applied in follow-up hardware studies. We also include a short sensitivity analysis under simulated moderate noise to illustrate that the parameter-efficiency advantage is not immediately erased. Full hardware benchmarking lies beyond the scope of the present work. revision: partial
Referee: [Methods] Methods and experimental protocol: the full training protocol, environment details, data-handling rules, and exact definition of the 66% parameter savings (which components, which baseline) are absent, preventing reproduction or assessment of whether the hybrid agent truly delivers the claimed efficiency.

Authors: We apologize for the insufficient detail. The revised Methods section now provides: complete hyperparameter tables (learning rates, discount factors, option durations, Adam optimizer settings, batch sizes, and total training steps); exact environment specifications and versions (e.g., Gymnasium 0.29 environments used); experience-replay buffer sizes and sampling rules; and the precise definition of parameter savings, computed as (P_classical - P_quantum)/P_classical where P_classical is the number of trainable parameters in the classical feature extractor baseline and P_quantum is the count for the variational quantum circuit variant. Pseudocode for the full training loop and a link to open-source code (to be released upon acceptance) have also been added. revision: yes

Circularity Check

0 steps flagged

No circularity; results are direct empirical comparisons on standard benchmarks

full rationale

The paper reports performance and parameter counts from explicit evaluations of a hybrid option-critic agent on standard RL environments, with ablations identifying bottlenecks such as quantum option-value estimation. No equations, derivations, or first-principles claims are presented that reduce the reported gains to fitted parameters, self-definitions, or self-citation chains. The central results rest on observable simulation outputs rather than any renaming, smuggling of ansatzes, or uniqueness theorems imported from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities beyond standard assumptions of variational quantum circuit trainability and benchmark environment validity.

pith-pipeline@v0.9.0 · 5456 in / 1002 out tokens · 80706 ms · 2026-05-07T17:16:55.021662+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 18 canonical work pages · 2 internal anchors

[1]

R. S. Sutton and A. G. Barto,Reinforcement Learning: An Introduction. Cambridge, MA, USA: A Bradford Book, 2018

2018
[2]

Mastering the game of go without human knowledge,

D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y . Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis, “Mastering the game of go without human knowledge,” Nature, vol. 550, no. 7676, pp. 354–359, 2017. [Online]. Available: https://www.nature.com/artic...

2017
[3]

arXiv preprint arXiv:1911.08265 , year =

J. Schrittwieser, I. Antonoglou, T. Hubert, K. Simonyan, L. Sifre, S. Schmitt, A. Guez, E. Lockhart, D. Hassabis, T. Graepel, T. Lillicrap, and D. Silver, “Mastering atari, go, chess and shogi by planning with a learned model,”Nature, vol. 588, no. 7839, pp. 604–609, Dec 2020. [Online]. Available: https://doi.org/10.1038/s41586-020-03051-4

work page doi:10.1038/s41586-020-03051-4 2020
[4]

Hindsight experience replay,

M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, O. Pieter Abbeel, and W. Zaremba, “Hindsight experience replay,” inAdvances in Neural Information Processing Systems, I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates, Inc., 2017. [Online...

2017
[5]

Hierarchical reinforcement learning: A comprehensive survey,

S. Pateria, B. Subagdja, A.-h. Tan, and C. Quek, “Hierarchical reinforcement learning: A comprehensive survey,”ACM Comput. Surv., vol. 54, no. 5, Jun. 2021. [Online]. Available: https: //doi.org/10.1145/3453160

work page doi:10.1145/3453160 2021
[6]

Reinforcement learning with hierarchies of machines,

R. Parr and S. Russell, “Reinforcement learning with hierarchies of machines,” inAdvances in Neural Information Processing Systems, M. Jordan, M. Kearns, and S. Solla, Eds., vol. 10. MIT Press, 1997. [Online]. Available: https://proceedings.neurips.cc/paper files/paper/199 7/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf

1997
[7]

Hierarchical reinforcement learning with the maxq value function decomposition,

T. G. Dietterich, “Hierarchical reinforcement learning with the maxq value function decomposition,”J. Artif. Int. Res., vol. 13, no. 1, p. 227–303, Nov. 2000

2000
[8]

Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning,

R. S. Sutton, D. Precup, and S. Singh, “Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning,”Artificial Intelligence, vol. 112, no. 1, pp. 181–211, 1999. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0004370299000521

1999
[9]

The option-critic architecture,

P.-L. Bacon, J. Harb, and D. Precup, “The option-critic architecture,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31, no. 1, Feb. 2017. [Online]. Available: https://ojs.aaai.org/index.php/AAA I/article/view/10916

2017
[10]

A survey on quantum reinforcement learning,

N. Meyer, C. Ufrecht, M. Periyasamy, D. D. Scherer, A. Plinge, and C. Mutschler, “A survey on quantum reinforcement learning,” 2024. [Online]. Available: https://arxiv.org/abs/2211.03464

work page arXiv 2024
[11]

Reinforcement learning with quantum variational circuit,

O. Lockwood and M. Si, “Reinforcement learning with quantum variational circuit,”Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, vol. 16, no. 1, pp. 245–251, Oct. 2020. [Online]. Available: https://ojs.aaai.org/index.php/A IIDE/article/view/7437

2020
[12]

Parametrized quantum policies for reinforcement learning,

S. Jerbi, C. Gyurik, S. Marshall, H. Briegel, and V . Dunjko, “Parametrized quantum policies for reinforcement learning,” in Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y . Dauphin, P. Liang, and J. W. Vaughan, Eds., vol. 34. Curran Associates, Inc., 2021, pp. 28 362–28 375. [Online]. Available: https://proceedings.neuri...

2021
[13]

Quantum advantage actor-critic for reinforcement learning,

M. K ¨olle, M. Hgog, F. Ritz, P. Altmann, M. Zorn, J. Stein, and C. Linnhoff-Popien, “Quantum advantage actor-critic for reinforcement learning,” inProceedings of the 16th International Conference on Agents and Artificial Intelligence - Volume 1: ICAART, INSTICC. SciTePress, 2024, pp. 297–304. [Online]. Available: https://www.scitepress.org/Paper s/2024/1...

2024
[14]

Variational Quantum Circuits for Deep Reinforcement Learning,

S. Y .-C. Chen, C.-H. H. Yang, J. Qi, P.-Y . Chen, X. Ma, and H.-S. Goan, “Variational Quantum Circuits for Deep Reinforcement Learning,”IEEE Access, vol. 8, pp. 141 007–141 024, 2020

2020
[15]

Introduction to Quantum Reinforcement Learning: Theory and PennyLane-based Implementation,

Y . Kwak, W. J. Yun, S. Jung, J.-K. Kim, and J. Kim, “Introduction to Quantum Reinforcement Learning: Theory and PennyLane-based Implementation,” in12th International Conference on Information and Communication Technology Convergence, 10 2021

2021
[16]

Expressibility and entangling capability of parameter- ized quantum circuits for hybrid quantum-classical al- gorithms,

S. Sim, P. D. Johnson, and A. Aspuru-Guzik, “Expressibility and entangling capability of parameterized quantum circuits for hybrid quantum-classical algorithms,”Advanced Quantum Technologies, vol. 2, no. 12, p. 1900070, 2019. [Online]. Available: https: //advanced.onlinelibrary.wiley.com/doi/abs/10.1002/qute.201900070

work page doi:10.1002/qute.201900070 2019
[17]

Q -learning

C. J. C. H. Watkins and P. Dayan, “Q-learning,”Machine Learning, vol. 8, no. 3, pp. 279–292, May 1992. [Online]. Available: https://doi.org/10.1007/BF00992698

work page doi:10.1007/bf00992698 1992
[18]

Human-level control through deep reinforcement learning,

V . Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning,”Nature, vol. 518, no. 7540, pp. 529–533, 2015

2015
[19]

Deep reinforcement learning with double q-learning,

H. v. Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double q-learning,” inProceedings of the Thirtieth AAAI Conference on Artificial Intelligence, ser. AAAI’16. AAAI Press, 2016, p. 2094–2100

2016
[20]

Dueling network architectures for deep reinforcement learning,

Z. Wang, T. Schaul, M. Hessel, H. Van Hasselt, M. Lanctot, and N. De Freitas, “Dueling network architectures for deep reinforcement learning,” inProceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ser. ICML’16. JMLR.org, 2016, p. 1995–2003

2016
[21]

Rainbow: Combining improvements in deep reinforcement learning,

M. Hessel, J. Modayil, H. van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver, “Rainbow: Combining improvements in deep reinforcement learning,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, Apr
[22]

Available: https://ojs.aaai.org/index.php/AAAI/article/vi ew/11796

[Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/vi ew/11796
[23]

Williams

R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,”Mach. Learn., vol. 8, no. 3–4, p. 229–256, May 1992. [Online]. Available: https: //doi.org/10.1007/BF00992696

work page doi:10.1007/bf00992696 1992
[24]

Continuous control with deep reinforcement learning,

T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y . Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” inInternational Conference on Learning Representations (ICLR), 2016

2016
[25]

Asynchronous methods for deep reinforcement learning,

V . Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” inProceedings of The 33rd International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, M. F. Balcan and K. Q. Weinberger, Eds., vol. 48. New York, New York, USA: PMLR, 2...

2016
[26]

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,

T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” inProceedings of the 35th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, J. Dy and A. Krause, Eds., vol. 80. PMLR, 10–15 Jul 2018, pp. 1861–1870. [Online]. Availa...

2018
[27]

Addressing function approximation error in actor-critic methods,

S. Fujimoto, H. van Hoof, and D. Meger, “Addressing function approximation error in actor-critic methods,” inProceedings of the 35th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, J. Dy and A. Krause, Eds., vol. 80. PMLR, 10–15 Jul 2018, pp. 1587–1596. [Online]. Available: https://proceedings.mlr.press/v80/fuj...

2018
[28]

Star: Bootstrapping reasoning with reasoning,

E. Zelikman, Y . Wu, J. Mu, and N. Goodman, “Star: Bootstrapping reasoning with reasoning,” inAdvances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35. Curran Associates, Inc., 2022, pp. 15 476–15 488. [Online]. Available: https://proceedings.neurips.cc/paper files/paper/202 2/file...

2022
[29]

Rl-star: Theoretical analysis of reinforcement learning frameworks for self-taught reasoner,

F.-C. Chang, Y .-T. Lee, H.-Y . Shih, Y . H. Tseng, and P.-Y . Wu, “Rl-star: Theoretical analysis of reinforcement learning frameworks for self-taught reasoner,” 2025. [Online]. Available: https://arxiv.org/abs/2410.23912

work page arXiv 2025
[30]

Learning options in reinforcement learning,

M. Stolle and D. Precup, “Learning options in reinforcement learning,” inAbstraction, Reformulation, and Approximation, S. Koenig and R. C. Holte, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2002, pp. 212–223

2002
[31]

Strategic attentive writer for learning macro-actions,

A. S. Vezhnevets, V . Mnih, J. Agapiou, S. Osindero, A. Graves, O. Vinyals, and K. Kavukcuoglu, “Strategic attentive writer for learning macro-actions,” inProceedings of the 30th International Conference on Neural Information Processing Systems, ser. NIPS’16. Red Hook, NY , USA: Curran Associates Inc., 2016, p. 3494–3502

2016
[32]

Dac: The double actor-critic architecture for learning options,

S. Zhang and S. Whiteson, “Dac: The double actor-critic architecture for learning options,” inAdvances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch ´e-Buc, E. Fox, and R. Garnett, Eds., vol. 32. Curran Associates, Inc., 2019. [Online]. Available: https://proceedings.neurips.cc/paper files/paper/201 9/file/...

2019
[33]

FeUdal networks for hierarchical reinforcement learning,

A. S. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, and K. Kavukcuoglu, “FeUdal networks for hierarchical reinforcement learning,” inProceedings of the 34th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, D. Precup and Y . W. Teh, Eds., vol. 70. PMLR, 06–11 Aug 2017, pp. 3540–3549. [Onl...

2017
[34]

Data-efficient hierarchical reinforcement learning,

O. Nachum, S. Gu, H. Lee, and S. Levine, “Data-efficient hierarchical reinforcement learning,” inProceedings of the 32nd International Conference on Neural Information Processing Systems, ser. NIPS’18. Red Hook, NY , USA: Curran Associates Inc., 2018, p. 3307–3317

2018
[35]

Why does hierarchy (sometimes) work so well in reinforcement learning?

O. Nachum, H. Tang, X. Lu, S. Gu, H. Lee, and S. Levine, “Why does hierarchy (sometimes) work so well in reinforcement learning?” 2019. [Online]. Available: https://arxiv.org/abs/1909.10618

work page arXiv 2019
[36]

Quantum agents in the Gym: a variational quantum algorithm for deep Q-learning,

A. Skolik, S. Jerbi, and V . Dunjko, “Quantum agents in the Gym: a variational quantum algorithm for deep Q-learning,”Quantum, vol. 6, p. 720, May 2022. [Online]. Available: https://doi.org/10.22331/q-2022-0 5-24-720

work page doi:10.22331/q-2022-0 2022
[37]

Quantum reinforcement learning,

D. Dong, C. Chen, H. Li, and T.-J. Tarn, “Quantum reinforcement learning,”Trans. Sys. Man Cyber. Part B, vol. 38, no. 5, p. 1207–1220, Oct
[38]

Available: https://doi.org/10.1109/TSMCB.2008.925743

[Online]. Available: https://doi.org/10.1109/TSMCB.2008.925743

work page doi:10.1109/tsmcb.2008.925743 2008
[39]

Data re-uploading for a universal quantum classifier,

A. P ´erez-Salinas, A. Cervera-Lierta, E. Gil-Fuster, and J. I. Latorre, “Data re-uploading for a universal quantum classifier,” Quantum, vol. 4, p. 226, Feb. 2020. [Online]. Available: https: //doi.org/10.22331/q-2020-02-06-226

work page doi:10.22331/q-2020-02-06-226 2020
[40]

doi:10.1103/physreva.103.032430

M. Schuld, R. Sweke, and J. J. Meyer, “Effect of data encoding on the expressive power of variational quantum-machine-learning models,” Phys. Rev. A, vol. 103, p. 032430, Mar 2021. [Online]. Available: https://link.aps.org/doi/10.1103/PhysRevA.103.032430

work page doi:10.1103/physreva.103.032430 2021
[41]

Nature549(7671), 242–246 (2017) https://doi.org/10.1038/nature23879

A. Kandala, A. Mezzacapo, K. Temme, M. Takita, M. Brink, J. M. Chow, and J. M. Gambetta, “Hardware-efficient variational quantum eigensolver for small molecules and quantum magnets,”Nature, vol. 549, no. 7671, pp. 242–246, Sep 2017. [Online]. Available: https://doi.org/10.1038/nature23879

work page doi:10.1038/nature23879 2017
[42]

Asynchronous training of quantum reinforcement learning,

S. Y .-C. Chen, “Asynchronous training of quantum reinforcement learning,”Procedia Computer Science, vol. 222, pp. 321–330, 2023, international Neural Network Society Workshop on Deep Learning Innovations and Applications (INNS DLIA 2023). [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1877050923009365

2023
[43]

Generalization in quantum machine learning from few training data,

M. C. Caro, H.-Y . Huang, M. Cerezo, K. Sharma, A. Sornborger, L. Cincio, and P. J. Coles, “Generalization in quantum machine learning from few training data,”Nature Communications, vol. 13, no. 1, p. 4919, Aug 2022. [Online]. Available: https://doi.org/10.1038/s41467-022 -32550-3

work page doi:10.1038/s41467-022 2022
[44]

Efficient relation extraction via quantum reinforcement learning,

X. Zhu, Y . Mu, X. Wang, and W. Zhu, “Efficient relation extraction via quantum reinforcement learning,”Complex & Intelligent Systems, vol. 10, p. 4009–4018, Feb. 2024

2024
[45]

Barren plateaus in quantum neural network training landscapes,

J. R. McClean, S. Boixo, V . N. Smelyanskiy, R. Babbush, and H. Neven, “Barren plateaus in quantum neural network training landscapes,”Nature communications, vol. 9, no. 1, p. 4812, 2018

2018
[46]

Robustness of quantum reinforcement learning under hardware errors,

A. Skolik, S. Mangini, T. B ¨ack, C. Macchiavello, and V . Dunjko, “Robustness of quantum reinforcement learning under hardware errors,” EPJ Quantum Technology, vol. 10, no. 1, p. 8, Feb 2023. [Online]. Available: https://doi.org/10.1140/epjqt/s40507-023-00166-1

work page doi:10.1140/epjqt/s40507-023-00166-1 2023
[47]

Trainability issues in quantum policy gradients,

A. Sequeira, L. Paulo Santos, and L. Soares Barbosa, “Trainability issues in quantum policy gradients,”Machine Learning: Science and Technology, vol. 5, no. 3, p. 035037, aug 2024. [Online]. Available: https://doi.org/10.1088/2632-2153/ad6830

work page doi:10.1088/2632-2153/ad6830 2024
[48]

Policy gradient methods for reinforcement learning with function approximation,

R. S. Sutton, D. McAllester, S. Singh, and Y . Mansour, “Policy gradient methods for reinforcement learning with function approximation,” in Proceedings of the 13th International Conference on Neural Information Processing Systems, 2000, pp. 1057–1063

2000
[49]

PennyLane: Automatic differentiation of hybrid quantum-classical computations

V . Bergholm, J. Izaac, M. Schuld, C. Gogolin, S. Ahmed, V . Ajith, M. S. Alam, G. Alonso-Linaje, B. AkashNarayanan, A. Asadi, J. M. Arrazola, U. Azad, S. Banning, C. Blank, T. R. Bromley, B. A. Cordier, J. Ceroni, A. Delgado, O. D. Matteo, A. Dusko, T. Garg, D. Guala, A. Hayes, R. Hill, A. Ijaz, T. Isacsson, D. Ittah, S. Jahangiri, P. Jain, E. Jiang, A. ...

work page internal anchor Pith review arXiv 2022
[50]

Gymnasium: A standard interface for reinforcement learning environments,

M. Towers, A. Kwiatkowski, J. U. Balis, G. D. Cola, T. Deleu, M. Goul ˜ao, K. Andreas, M. Krimmel, A. KG, R. D. L. Perez-Vicente, J. K. Terry, A. Pierr ´e, S. V . Schulhoff, J. J. Tai, H. Tan, and O. G. Younis, “Gymnasium: A standard interface for reinforcement learning environments,” inThe Thirty-ninth Annual Conference on Neural Information Processing S...

2025
[51]

Adam: A method for stochastic optimization,

D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
[52]

Adam: A Method for Stochastic Optimization

[Online]. Available: https://arxiv.org/abs/1412.6980

work page internal anchor Pith review arXiv
[53]

Learning to program quantum measurements for machine learning,

S. Y .-C. Chen, H.-H. Tseng, H.-Y . Lin, and S. Yoo, “Learning to program quantum measurements for machine learning,” in2025 IEEE International Conference on Quantum Computing and Engineering (QCE), vol. 01, 2025, pp. 1826–1836

2025