Enhancing Human-Likeness in Reinforcement Learning Agents via Hierarchical Macro Action Quantization

Ali Shah Ali; Fawad Javed Fateh; M. Shaheer Luqman; Murad Popattia; M. Zeeshan Zia; Quoc-Huy Tran; Usman Nizamani

arxiv: 2605.30928 · v1 · pith:CCHQBM7Xnew · submitted 2026-05-29 · 💻 cs.RO

Enhancing Human-Likeness in Reinforcement Learning Agents via Hierarchical Macro Action Quantization

Usman Nizamani , M. Shaheer Luqman , Fawad Javed Fateh , Ali Shah Ali , Murad Popattia , M. Zeeshan Zia , Quoc-Huy Tran This is my paper

Pith reviewed 2026-06-28 22:27 UTC · model grok-4.3

classification 💻 cs.RO

keywords reinforcement learninghuman-like agentsvector quantizationmacro actionsD4RLhierarchical learning

0 comments

The pith

Two levels of vector quantization turn human action sequences into macro actions that let RL agents match human behavior more closely while keeping high task success.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces HiMAQ, which applies two successive vector quantizations to human demonstrations so that lower-level clusters capture fine subactions and higher-level clusters group them into reusable macro actions. Reinforcement learning agents then learn policies over these macro actions instead of raw actions, producing trajectories that score higher on human-likeness metrics. On D4RL benchmark tasks the hierarchical version beats the flat MAQ baseline on human-likeness while matching or exceeding success rates, and the gains hold when the same structure is plugged into IQL, SAC, or RLPD.

Core claim

Encoding human demonstrations into macro actions via two successive levels of vector quantization enables RL agents to generate behaviors that align more closely with humans while still maximizing rewards, and this hierarchical approach improves human-likeness scores over single-level quantization without reducing task performance.

What carries the argument

HiMAQ, a two-level hierarchical macro action quantization in which the lower level maps actions to fine-grained subaction clusters and the higher level aggregates those clusters into macro action clusters.

If this is right

Agents using the two-level structure outperform the non-hierarchical MAQ baseline on human-likeness while keeping comparable or better success rates.
The same hierarchical quantization integrates successfully with IQL, SAC, and RLPD and delivers the human-likeness gains across all three.
The resulting policies remain competitive with earlier RL agents on the same D4RL tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the macro actions capture reusable human movement patterns they could transfer to new tasks without retraining the quantization layers.
Adding a third quantization level might further improve alignment when human behavior contains distinct sub-phases within a macro action.
The same two-level structure could be tested on real robot hardware to check whether the human-likeness gains reduce the need for post-hoc safety filters.

Load-bearing premise

The human-likeness scores computed on D4RL data actually measure behavioral similarity to humans and the two-level structure is what produces the measured gains rather than other implementation choices.

What would settle it

Retraining the identical agents on the same data after collapsing the two quantization levels into one while keeping every other detail fixed and finding no drop in human-likeness scores would falsify the claim that the hierarchy itself is responsible.

Figures

Figures reproduced from arXiv: 2605.30928 by Ali Shah Ali, Fawad Javed Fateh, M. Shaheer Luqman, Murad Popattia, M. Zeeshan Zia, Quoc-Huy Tran, Usman Nizamani.

**Figure 2.** Figure 2: Impacts of different macro action lengths [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Turing test win rates (i.e., fraction of trials where evaluators mistook an agent for a human). [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Human-likeness ranking test pairwise win rates. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Human-like agents are a long-standing goal of artificial intelligence. Despite strong performance, most reinforcement learning (RL) agents remain reward-driven and often exhibit behaviors that differ from humans, limiting interpretability and reliability. In this work, we introduce a novel human-like RL framework that predicts action sequences closely aligned with human behaviors while maximizing rewards. Specifically, we encode human demonstrations into macro actions using a hierarchical macro action quantization approach (termed HiMAQ) consisting of two successive levels of vector quantization. The lower quantization level maps input actions to fine-grained subaction clusters, while the higher quantization level aggregates these subaction clusters into action clusters. Extensive evaluations on the D4RL benchmarks show that our hierarchical approach outperforms the non-hierarchical baseline (MAQ), achieving better human-likeness scores while maintaining comparable or better success rates than previous RL agents. The improvements generalize across integrations with various RL algorithms, namely IQL, SAC, and RLPD.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HiMAQ is a direct two-level VQ extension of the authors' own MAQ, but the abstract supplies no metric definition or ablations, leaving the hierarchy's causal role unproven.

read the letter

The one thing to know is that HiMAQ is basically MAQ with an extra quantization layer on top, and the abstract does not give enough to judge whether that extra layer is what drives any improvement in human-likeness.

The paper introduces a two-stage quantization: first clustering actions into subactions, then clustering those into macro actions. This is presented as a way to make RL agents behave more like humans by learning from demos in this structured way. It reports better human-likeness than the flat MAQ baseline on D4RL tasks, with success rates that are at least as good, and shows this holds when plugged into three different RL methods.

That integration across algorithms is a positive point, as is sticking to public benchmarks. The approach is simple and builds directly on their prior work without introducing exotic new components.

Where it falls short is in the supporting evidence. There is no explanation of how human-likeness is measured, no statistical significance reported, and no experiments that isolate the effect of the hierarchy from other changes like codebook sizes or training procedures. If the metric turns out to be something basic like distribution matching on actions, or if a non-hierarchical version with matched capacity performs similarly, then the central selling point does not hold. The stress-test concern about causality is valid based on the abstract alone.

This kind of paper is aimed at researchers in robotics and RL who want agents that look more natural. It could be useful for someone already working on macro actions or imitation in offline settings.

I would recommend sending it for peer review. The idea is clear enough that referees could ask for the necessary ablations and clarifications, and it might turn into a solid incremental result if those are addressed.

Referee Report

3 major / 2 minor

Summary. The paper introduces HiMAQ, a two-level hierarchical vector quantization method (subaction clusters followed by action clusters) to encode human demonstrations into macro actions for RL agents. It claims this yields better human-likeness scores than the non-hierarchical MAQ baseline on D4RL benchmarks while preserving or improving success rates, with the gains generalizing when the macro actions are integrated into IQL, SAC, and RLPD.

Significance. If the hierarchy is shown to be the causal factor and the human-likeness metric is validated as capturing behavioral similarity, the framework could offer a practical route to more interpretable RL policies that align with human behavior on offline benchmarks.

major comments (3)

[Abstract and Experiments] Abstract and §4 (Experiments): the human-likeness metric is never defined, nor is any validation provided that it measures behavioral similarity to humans rather than a simple distributional distance; without this, the central outperformance claim cannot be assessed.
[Method and Ablations] §3 (Method) and §4.2 (Ablations): no ablation isolates the two successive VQ levels from capacity or training differences; a single-level VQ with matched total codebook size could produce equivalent scores, so the attribution of gains to hierarchy is unsupported.
[Integration and Experiments] §3.3 (Integration) and §4.3: the paper states that HiMAQ generalizes across IQL/SAC/RLPD but supplies no description of how the quantized macro actions are actually inserted into each algorithm's policy or replay buffer, leaving open whether observed differences arise from the hierarchy or from unstated implementation choices.

minor comments (2)

[Abstract] Abstract: numerical human-likeness and success-rate values, together with standard errors or statistical tests, are omitted.
[Method] Notation in §3: the distinction between the lower-level subaction codebook and the higher-level action codebook is introduced without an explicit equation relating the two quantization steps.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will make the indicated revisions to improve clarity and rigor.

read point-by-point responses

Referee: [Abstract and Experiments] Abstract and §4 (Experiments): the human-likeness metric is never defined, nor is any validation provided that it measures behavioral similarity to humans rather than a simple distributional distance; without this, the central outperformance claim cannot be assessed.

Authors: We agree that the human-likeness metric requires an explicit definition and validation within the main text. In the revised manuscript we will add a dedicated subsection (new §3.4) that formally defines the metric (including its computation from demonstration trajectories) and provide supporting validation experiments that compare it against direct behavioral similarity measures such as trajectory overlap and human preference ratings. These additions will be placed before the experimental results to ensure the central claims can be properly assessed. revision: yes
Referee: [Method and Ablations] §3 (Method) and §4.2 (Ablations): no ablation isolates the two successive VQ levels from capacity or training differences; a single-level VQ with matched total codebook size could produce equivalent scores, so the attribution of gains to hierarchy is unsupported.

Authors: We acknowledge the need for an ablation that controls for total codebook capacity. We will add a new experiment in §4.2 that compares HiMAQ against a single-level vector quantization baseline whose codebook size equals the product of the two HiMAQ levels. This will isolate whether the observed gains stem from the hierarchical structure rather than increased representational capacity. revision: yes
Referee: [Integration and Experiments] §3.3 (Integration) and §4.3: the paper states that HiMAQ generalizes across IQL/SAC/RLPD but supplies no description of how the quantized macro actions are actually inserted into each algorithm's policy or replay buffer, leaving open whether observed differences arise from the hierarchy or from unstated implementation choices.

Authors: We will expand §3.3 with explicit pseudocode and textual descriptions of the integration procedure for each algorithm. The revised section will detail how macro-action sequences are encoded into the policy input, how they are sampled during training, and how they are stored and retrieved from the replay buffer for IQL, SAC, and RLPD respectively. This will eliminate ambiguity regarding implementation choices. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark comparison with no derivational reduction

full rationale

The paper introduces HiMAQ as a two-level vector quantization method on human demonstrations and reports empirical outperformance versus the non-hierarchical MAQ baseline on D4RL success rates and human-likeness scores, with generalization across IQL/SAC/RLPD. No equations, derivations, or first-principles claims appear; the central result is a set of benchmark numbers whose validity rests on external data and controls rather than any quantity defined inside the paper being renamed or fitted as a prediction. Self-citation of MAQ (if present) is not load-bearing for a mathematical result because the evaluation is comparative and falsifiable on public benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no equations, no parameter counts, and no explicit modeling assumptions. Free parameters, axioms, and invented entities cannot be extracted.

pith-pipeline@v0.9.1-grok · 5728 in / 1353 out tokens · 25217 ms · 2026-06-28T22:27:46.399992+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 5 canonical work pages · 4 internal anchors

[1]

Schrittwieser, I

J. Schrittwieser, I. Antonoglou, T. Hubert, K. Simonyan, L. Sifre, S. Schmitt, A. Guez, E. Lock- hart, D. Hassabis, T. Graepel, T. Lillicrap, and D. Silver. Mastering Atari, Go, chess and shogi by planning with a learned model.Nature, 588(7839):604–609, Dec. 2020

2020
[2]

Vinyals, I

O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, J. Oh, D. Horgan, M. Kroiss, I. Danihelka, A. Huang, L. Sifre, T. Cai, J. P. Agapiou, M. Jaderberg, A. S. Vezhnevets, R. Leblond, T. Pohlen, V . Dalibard, D. Budden, Y . Sulsky, J. Molloy, T. L. Paine, C. Gulcehre, Z. Wang, T. Pfaff,...

2019
[3]

Berner, G

OpenAI, C. Berner, G. Brockman, B. Chan, V . Cheung, P. Debiak, C. Dennison, D. Farhi, Q. Fischer, S. Hashme, C. Hesse, R. Jozefowicz, S. Gray, C. Olsson, J. Pachocki, M. Petrov, H. P. d. O. Pinto, J. Raiman, T. Salimans, J. Schlatter, J. Schneider, S. Sidor, I. Sutskever, J. Tang, F. Wolski, and S. Zhang. Dota 2 with Large Scale Deep Reinforcement Learni...

2019
[4]

Hafner, J

D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap. Mastering diverse control tasks through world models.Nature, pages 1–7, Apr. 2025

2025
[5]

J. Fu, A. Kumar, O. Nachum, G. Tucker, and S. Levine. D4RL: Datasets for Deep Data-Driven Reinforcement Learning, Feb. 2021

2021
[6]

P. J. Ball, L. Smith, I. Kostrikov, and S. Levine. Efficient Online Reinforcement Learning with Offline Data. InProceedings of the 40th International Conference on Machine Learning, pages 1577–1594. PMLR, July 2023

2023
[7]

Mysore, B

S. Mysore, B. Mabsout, R. Mancuso, and K. Saenko. Regularizing Action Policies for Smooth Control with Reinforcement Learning. In2021 IEEE International Conference on Robotics and Automation (ICRA), pages 1810–1816. IEEE Press, May 2021

2021
[8]

Lee, H.-G

I. Lee, H.-G. Cao, C.-T. Dao, Y .-C. Chen, and I.-C. Wu. Gradient-based Regularization for Action Smoothness in Robotic Control with Reinforcement Learning. In2024 IEEE/RSJ In- ternational Conference on Intelligent Robots and Systems (IROS), pages 603–610, Oct. 2024

2024
[9]

Q. Shen, Y . Li, H. Jiang, Z. Wang, and T. Zhao. Deep Reinforcement Learning with Robust and Smooth Policy. InProceedings of the 37th International Conference on Machine Learning, pages 8707–8718. PMLR, Nov. 2020

2020
[10]

W. Koch. Flight Controller Synthesis Via Deep Reinforcement Learning, Sept. 2019

2019
[11]

Milani, A

S. Milani, A. Juliani, I. Momennejad, R. Georgescu, J. Rzepecki, A. Shaw, G. Costello, F. Fang, S. Devlin, and K. Hofmann. Navigates Like Me: Understanding How People Evaluate Human- Like AI in Video Games. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems, CHI ’23, pages 1–18. Association for Computing Machinery, Apr. 2023

2023
[12]

Devlin, R

S. Devlin, R. Georgescu, I. Momennejad, J. Rzepecki, E. Zuniga, G. Costello, G. Leroy, A. Shaw, and K. Hofmann. Navigation Turing Test (NTT): Learning to Evaluate Human- Like Navigation. InProceedings of the 38th International Conference on Machine Learning, pages 2644–2653. PMLR, July 2021. 10

2021
[13]

Zuniga, S

E. Zuniga, S. Milani, G. Leroy, J. Rzepecki, R. Georgescu, I. Momennejad, D. Bignell, M. Sun, A. Shaw, G. Costello, M. Jacob, S. Devlin, and K. Hofmann. How Humans Perceive Human- like Behavior in Video Game Navigation. InExtended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems, CHI EA ’22, pages 1–11. Association for Comput- in...

2022
[14]

Ho, P.-C

K.-H. Ho, P.-C. Hsieh, C.-C. Lin, Y .-R. Lou, F.-J. Wang, and I.-C. Wu. Towards Human- Like RL: Taming Non-Naturalistic Behavior in Deep RL via Adaptive Behavioral Costs in 3D Games. InProceedings of the 15th Asian Conference on Machine Learning, pages 438–453. PMLR, Feb. 2024

2024
[15]

Temporal difference learning for model predictive control.arXiv preprint arXiv:2203.04955, 2022

N. Hansen, X. Wang, and H. Su. Temporal difference learning for model predictive control. arXiv preprint arXiv:2203.04955, 2022

work page arXiv 2022
[16]

Guo, Y .-C

J.-T. Guo, Y .-C. Chen, P.-C. Hsieh, K.-H. Ho, P.-W. Huang, T.-R. Wu, I. Wu, et al. Learning human-like rl agents through trajectory optimization with action quantization.Advances in Neural Information Processing Systems, 38:83534–83565, 2026

2026
[17]

Van Den Oord, O

A. Van Den Oord, O. Vinyals, et al. Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

2017
[18]

Spurio, E

F. Spurio, E. Bahrami, G. Francesca, and J. Gall. Hierarchical vector quantization for unsuper- vised action segmentation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 6996–7005, 2025

2025
[19]

Kostrikov, A

I. Kostrikov, A. Nair, and S. Levine. Offline Reinforcement Learning with Implicit Q-Learning. InInternational Conference on Learning Representations, Oct. 2021

2021
[20]

Haarnoja, A

T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. InProceedings of the 35th International Conference on Machine Learning, pages 1861–1870. PMLR, July 2018

2018
[21]

Fujii, Y

N. Fujii, Y . Sato, H. Wakama, K. Kazai, and H. Katayose. Evaluating human-like behaviors of video-game agents autonomously acquired with biological constraints. InInternational Conference on Advances in Computer Entertainment Technology, pages 61–76. Springer, 2013

2013
[22]

Ho, P.-C

K.-H. Ho, P.-C. Hsieh, C.-C. Lin, Y .-R. Lou, F.-J. Wang, and I.-C. Wu. Towards human-like rl: Taming non-naturalistic behavior in deep rl via adaptive behavioral costs in 3d games. In Asian Conference on Machine Learning, pages 438–453. PMLR, 2024

2024
[23]

Hester, M

T. Hester, M. Vecerik, O. Pietquin, M. Lanctot, T. Schaul, B. Piot, D. Horgan, J. Quan, A. Sendonaris, I. Osband, et al. Deep q-learning from demonstrations. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

2018
[24]

End to End Learning for Self-Driving Cars

M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, et al. End to end learning for self-driving cars.arXiv preprint arXiv:1604.07316, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[25]

M. Zare, P. M. Kebria, A. Khosravi, and S. Nahavandi. A survey of imitation learning: Al- gorithms, recent developments, and challenges.IEEE Transactions on Cybernetics, 54(12): 7173–7186, 2024

2024
[26]

Arora and P

S. Arora and P. Doshi. A survey of inverse reinforcement learning: Challenges, methods and progress.Artificial Intelligence, 297:103500, 2021

2021
[27]

I. P. Durugkar, C. Rosenbaum, S. Dernbach, and S. Mahadevan. Deep reinforcement learning with macro-actions.arXiv preprint arXiv:1606.04615, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[28]

D. P. Kingma and M. Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013. 11

work page internal anchor Pith review Pith/arXiv arXiv 2013
[29]

Ozair, Y

S. Ozair, Y . Li, A. Razavi, I. Antonoglou, A. Van Den Oord, and O. Vinyals. Vector quantized models for planning. Ininternational conference on machine learning, pages 8302–8313. PMLR, 2021

2021
[30]

Antonoglou, J

I. Antonoglou, J. Schrittwieser, S. Ozair, T. K. Hubert, and D. Silver. Planning in stochastic environments with a learned model. InInternational Conference on Learning Representations, 2021

2021
[31]

J. Luo, P. Dong, J. Wu, A. Kumar, X. Geng, and S. Levine. Action-quantized offline reinforce- ment learning for robotic skill learning. InConference on Robot Learning, pages 1348–1361. PMLR, 2023

2023
[32]

A. D. Vuong, M. N. Vu, D. An, and I. Reid. Action tokenizer matters in in-context imita- tion learning. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 13490–13496. IEEE, 2025

2025
[33]

C. G. Atkeson and S. Schaal. Robot Learning From Demonstration. InProceedings of the Fourteenth International Conference on Machine Learning, ICML ’97, pages 12–20. Morgan Kaufmann Publishers Inc., July 1997

1997
[34]

H. Zhou, T. Wei, Z. Lin, j. li, J. Xing, Y . Shi, L. Shen, C. Yu, and D. Ye. Revisiting Discrete Soft Actor-Critic, Nov. 2024

2024
[35]

Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations

A. Rajeswaran, V . Kumar, A. Gupta, G. Vezzani, J. Schulman, E. Todorov, and S. Levine. Learning complex dexterous manipulation with deep reinforcement learning and demonstra- tions.arXiv preprint arXiv:1709.10087, 2017. 12

work page internal anchor Pith review Pith/arXiv arXiv 2017

[1] [1]

Schrittwieser, I

J. Schrittwieser, I. Antonoglou, T. Hubert, K. Simonyan, L. Sifre, S. Schmitt, A. Guez, E. Lock- hart, D. Hassabis, T. Graepel, T. Lillicrap, and D. Silver. Mastering Atari, Go, chess and shogi by planning with a learned model.Nature, 588(7839):604–609, Dec. 2020

2020

[2] [2]

Vinyals, I

O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, J. Oh, D. Horgan, M. Kroiss, I. Danihelka, A. Huang, L. Sifre, T. Cai, J. P. Agapiou, M. Jaderberg, A. S. Vezhnevets, R. Leblond, T. Pohlen, V . Dalibard, D. Budden, Y . Sulsky, J. Molloy, T. L. Paine, C. Gulcehre, Z. Wang, T. Pfaff,...

2019

[3] [3]

Berner, G

OpenAI, C. Berner, G. Brockman, B. Chan, V . Cheung, P. Debiak, C. Dennison, D. Farhi, Q. Fischer, S. Hashme, C. Hesse, R. Jozefowicz, S. Gray, C. Olsson, J. Pachocki, M. Petrov, H. P. d. O. Pinto, J. Raiman, T. Salimans, J. Schlatter, J. Schneider, S. Sidor, I. Sutskever, J. Tang, F. Wolski, and S. Zhang. Dota 2 with Large Scale Deep Reinforcement Learni...

2019

[4] [4]

Hafner, J

D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap. Mastering diverse control tasks through world models.Nature, pages 1–7, Apr. 2025

2025

[5] [5]

J. Fu, A. Kumar, O. Nachum, G. Tucker, and S. Levine. D4RL: Datasets for Deep Data-Driven Reinforcement Learning, Feb. 2021

2021

[6] [6]

P. J. Ball, L. Smith, I. Kostrikov, and S. Levine. Efficient Online Reinforcement Learning with Offline Data. InProceedings of the 40th International Conference on Machine Learning, pages 1577–1594. PMLR, July 2023

2023

[7] [7]

Mysore, B

S. Mysore, B. Mabsout, R. Mancuso, and K. Saenko. Regularizing Action Policies for Smooth Control with Reinforcement Learning. In2021 IEEE International Conference on Robotics and Automation (ICRA), pages 1810–1816. IEEE Press, May 2021

2021

[8] [8]

Lee, H.-G

I. Lee, H.-G. Cao, C.-T. Dao, Y .-C. Chen, and I.-C. Wu. Gradient-based Regularization for Action Smoothness in Robotic Control with Reinforcement Learning. In2024 IEEE/RSJ In- ternational Conference on Intelligent Robots and Systems (IROS), pages 603–610, Oct. 2024

2024

[9] [9]

Q. Shen, Y . Li, H. Jiang, Z. Wang, and T. Zhao. Deep Reinforcement Learning with Robust and Smooth Policy. InProceedings of the 37th International Conference on Machine Learning, pages 8707–8718. PMLR, Nov. 2020

2020

[10] [10]

W. Koch. Flight Controller Synthesis Via Deep Reinforcement Learning, Sept. 2019

2019

[11] [11]

Milani, A

S. Milani, A. Juliani, I. Momennejad, R. Georgescu, J. Rzepecki, A. Shaw, G. Costello, F. Fang, S. Devlin, and K. Hofmann. Navigates Like Me: Understanding How People Evaluate Human- Like AI in Video Games. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems, CHI ’23, pages 1–18. Association for Computing Machinery, Apr. 2023

2023

[12] [12]

Devlin, R

S. Devlin, R. Georgescu, I. Momennejad, J. Rzepecki, E. Zuniga, G. Costello, G. Leroy, A. Shaw, and K. Hofmann. Navigation Turing Test (NTT): Learning to Evaluate Human- Like Navigation. InProceedings of the 38th International Conference on Machine Learning, pages 2644–2653. PMLR, July 2021. 10

2021

[13] [13]

Zuniga, S

E. Zuniga, S. Milani, G. Leroy, J. Rzepecki, R. Georgescu, I. Momennejad, D. Bignell, M. Sun, A. Shaw, G. Costello, M. Jacob, S. Devlin, and K. Hofmann. How Humans Perceive Human- like Behavior in Video Game Navigation. InExtended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems, CHI EA ’22, pages 1–11. Association for Comput- in...

2022

[14] [14]

Ho, P.-C

K.-H. Ho, P.-C. Hsieh, C.-C. Lin, Y .-R. Lou, F.-J. Wang, and I.-C. Wu. Towards Human- Like RL: Taming Non-Naturalistic Behavior in Deep RL via Adaptive Behavioral Costs in 3D Games. InProceedings of the 15th Asian Conference on Machine Learning, pages 438–453. PMLR, Feb. 2024

2024

[15] [15]

Temporal difference learning for model predictive control.arXiv preprint arXiv:2203.04955, 2022

N. Hansen, X. Wang, and H. Su. Temporal difference learning for model predictive control. arXiv preprint arXiv:2203.04955, 2022

work page arXiv 2022

[16] [16]

Guo, Y .-C

J.-T. Guo, Y .-C. Chen, P.-C. Hsieh, K.-H. Ho, P.-W. Huang, T.-R. Wu, I. Wu, et al. Learning human-like rl agents through trajectory optimization with action quantization.Advances in Neural Information Processing Systems, 38:83534–83565, 2026

2026

[17] [17]

Van Den Oord, O

A. Van Den Oord, O. Vinyals, et al. Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

2017

[18] [18]

Spurio, E

F. Spurio, E. Bahrami, G. Francesca, and J. Gall. Hierarchical vector quantization for unsuper- vised action segmentation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 6996–7005, 2025

2025

[19] [19]

Kostrikov, A

I. Kostrikov, A. Nair, and S. Levine. Offline Reinforcement Learning with Implicit Q-Learning. InInternational Conference on Learning Representations, Oct. 2021

2021

[20] [20]

Haarnoja, A

T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. InProceedings of the 35th International Conference on Machine Learning, pages 1861–1870. PMLR, July 2018

2018

[21] [21]

Fujii, Y

N. Fujii, Y . Sato, H. Wakama, K. Kazai, and H. Katayose. Evaluating human-like behaviors of video-game agents autonomously acquired with biological constraints. InInternational Conference on Advances in Computer Entertainment Technology, pages 61–76. Springer, 2013

2013

[22] [22]

Ho, P.-C

K.-H. Ho, P.-C. Hsieh, C.-C. Lin, Y .-R. Lou, F.-J. Wang, and I.-C. Wu. Towards human-like rl: Taming non-naturalistic behavior in deep rl via adaptive behavioral costs in 3d games. In Asian Conference on Machine Learning, pages 438–453. PMLR, 2024

2024

[23] [23]

Hester, M

T. Hester, M. Vecerik, O. Pietquin, M. Lanctot, T. Schaul, B. Piot, D. Horgan, J. Quan, A. Sendonaris, I. Osband, et al. Deep q-learning from demonstrations. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

2018

[24] [24]

End to End Learning for Self-Driving Cars

M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, et al. End to end learning for self-driving cars.arXiv preprint arXiv:1604.07316, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[25] [25]

M. Zare, P. M. Kebria, A. Khosravi, and S. Nahavandi. A survey of imitation learning: Al- gorithms, recent developments, and challenges.IEEE Transactions on Cybernetics, 54(12): 7173–7186, 2024

2024

[26] [26]

Arora and P

S. Arora and P. Doshi. A survey of inverse reinforcement learning: Challenges, methods and progress.Artificial Intelligence, 297:103500, 2021

2021

[27] [27]

I. P. Durugkar, C. Rosenbaum, S. Dernbach, and S. Mahadevan. Deep reinforcement learning with macro-actions.arXiv preprint arXiv:1606.04615, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[28] [28]

D. P. Kingma and M. Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013. 11

work page internal anchor Pith review Pith/arXiv arXiv 2013

[29] [29]

Ozair, Y

S. Ozair, Y . Li, A. Razavi, I. Antonoglou, A. Van Den Oord, and O. Vinyals. Vector quantized models for planning. Ininternational conference on machine learning, pages 8302–8313. PMLR, 2021

2021

[30] [30]

Antonoglou, J

I. Antonoglou, J. Schrittwieser, S. Ozair, T. K. Hubert, and D. Silver. Planning in stochastic environments with a learned model. InInternational Conference on Learning Representations, 2021

2021

[31] [31]

J. Luo, P. Dong, J. Wu, A. Kumar, X. Geng, and S. Levine. Action-quantized offline reinforce- ment learning for robotic skill learning. InConference on Robot Learning, pages 1348–1361. PMLR, 2023

2023

[32] [32]

A. D. Vuong, M. N. Vu, D. An, and I. Reid. Action tokenizer matters in in-context imita- tion learning. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 13490–13496. IEEE, 2025

2025

[33] [33]

C. G. Atkeson and S. Schaal. Robot Learning From Demonstration. InProceedings of the Fourteenth International Conference on Machine Learning, ICML ’97, pages 12–20. Morgan Kaufmann Publishers Inc., July 1997

1997

[34] [34]

H. Zhou, T. Wei, Z. Lin, j. li, J. Xing, Y . Shi, L. Shen, C. Yu, and D. Ye. Revisiting Discrete Soft Actor-Critic, Nov. 2024

2024

[35] [35]

Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations

A. Rajeswaran, V . Kumar, A. Gupta, G. Vezzani, J. Schulman, E. Todorov, and S. Levine. Learning complex dexterous manipulation with deep reinforcement learning and demonstra- tions.arXiv preprint arXiv:1709.10087, 2017. 12

work page internal anchor Pith review Pith/arXiv arXiv 2017