Enhancing Human-Likeness in Reinforcement Learning Agents via Hierarchical Macro Action Quantization
Pith reviewed 2026-06-28 22:27 UTC · model grok-4.3
The pith
Two levels of vector quantization turn human action sequences into macro actions that let RL agents match human behavior more closely while keeping high task success.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Encoding human demonstrations into macro actions via two successive levels of vector quantization enables RL agents to generate behaviors that align more closely with humans while still maximizing rewards, and this hierarchical approach improves human-likeness scores over single-level quantization without reducing task performance.
What carries the argument
HiMAQ, a two-level hierarchical macro action quantization in which the lower level maps actions to fine-grained subaction clusters and the higher level aggregates those clusters into macro action clusters.
If this is right
- Agents using the two-level structure outperform the non-hierarchical MAQ baseline on human-likeness while keeping comparable or better success rates.
- The same hierarchical quantization integrates successfully with IQL, SAC, and RLPD and delivers the human-likeness gains across all three.
- The resulting policies remain competitive with earlier RL agents on the same D4RL tasks.
Where Pith is reading between the lines
- If the macro actions capture reusable human movement patterns they could transfer to new tasks without retraining the quantization layers.
- Adding a third quantization level might further improve alignment when human behavior contains distinct sub-phases within a macro action.
- The same two-level structure could be tested on real robot hardware to check whether the human-likeness gains reduce the need for post-hoc safety filters.
Load-bearing premise
The human-likeness scores computed on D4RL data actually measure behavioral similarity to humans and the two-level structure is what produces the measured gains rather than other implementation choices.
What would settle it
Retraining the identical agents on the same data after collapsing the two quantization levels into one while keeping every other detail fixed and finding no drop in human-likeness scores would falsify the claim that the hierarchy itself is responsible.
Figures
read the original abstract
Human-like agents are a long-standing goal of artificial intelligence. Despite strong performance, most reinforcement learning (RL) agents remain reward-driven and often exhibit behaviors that differ from humans, limiting interpretability and reliability. In this work, we introduce a novel human-like RL framework that predicts action sequences closely aligned with human behaviors while maximizing rewards. Specifically, we encode human demonstrations into macro actions using a hierarchical macro action quantization approach (termed HiMAQ) consisting of two successive levels of vector quantization. The lower quantization level maps input actions to fine-grained subaction clusters, while the higher quantization level aggregates these subaction clusters into action clusters. Extensive evaluations on the D4RL benchmarks show that our hierarchical approach outperforms the non-hierarchical baseline (MAQ), achieving better human-likeness scores while maintaining comparable or better success rates than previous RL agents. The improvements generalize across integrations with various RL algorithms, namely IQL, SAC, and RLPD.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces HiMAQ, a two-level hierarchical vector quantization method (subaction clusters followed by action clusters) to encode human demonstrations into macro actions for RL agents. It claims this yields better human-likeness scores than the non-hierarchical MAQ baseline on D4RL benchmarks while preserving or improving success rates, with the gains generalizing when the macro actions are integrated into IQL, SAC, and RLPD.
Significance. If the hierarchy is shown to be the causal factor and the human-likeness metric is validated as capturing behavioral similarity, the framework could offer a practical route to more interpretable RL policies that align with human behavior on offline benchmarks.
major comments (3)
- [Abstract and Experiments] Abstract and §4 (Experiments): the human-likeness metric is never defined, nor is any validation provided that it measures behavioral similarity to humans rather than a simple distributional distance; without this, the central outperformance claim cannot be assessed.
- [Method and Ablations] §3 (Method) and §4.2 (Ablations): no ablation isolates the two successive VQ levels from capacity or training differences; a single-level VQ with matched total codebook size could produce equivalent scores, so the attribution of gains to hierarchy is unsupported.
- [Integration and Experiments] §3.3 (Integration) and §4.3: the paper states that HiMAQ generalizes across IQL/SAC/RLPD but supplies no description of how the quantized macro actions are actually inserted into each algorithm's policy or replay buffer, leaving open whether observed differences arise from the hierarchy or from unstated implementation choices.
minor comments (2)
- [Abstract] Abstract: numerical human-likeness and success-rate values, together with standard errors or statistical tests, are omitted.
- [Method] Notation in §3: the distinction between the lower-level subaction codebook and the higher-level action codebook is introduced without an explicit equation relating the two quantization steps.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and will make the indicated revisions to improve clarity and rigor.
read point-by-point responses
-
Referee: [Abstract and Experiments] Abstract and §4 (Experiments): the human-likeness metric is never defined, nor is any validation provided that it measures behavioral similarity to humans rather than a simple distributional distance; without this, the central outperformance claim cannot be assessed.
Authors: We agree that the human-likeness metric requires an explicit definition and validation within the main text. In the revised manuscript we will add a dedicated subsection (new §3.4) that formally defines the metric (including its computation from demonstration trajectories) and provide supporting validation experiments that compare it against direct behavioral similarity measures such as trajectory overlap and human preference ratings. These additions will be placed before the experimental results to ensure the central claims can be properly assessed. revision: yes
-
Referee: [Method and Ablations] §3 (Method) and §4.2 (Ablations): no ablation isolates the two successive VQ levels from capacity or training differences; a single-level VQ with matched total codebook size could produce equivalent scores, so the attribution of gains to hierarchy is unsupported.
Authors: We acknowledge the need for an ablation that controls for total codebook capacity. We will add a new experiment in §4.2 that compares HiMAQ against a single-level vector quantization baseline whose codebook size equals the product of the two HiMAQ levels. This will isolate whether the observed gains stem from the hierarchical structure rather than increased representational capacity. revision: yes
-
Referee: [Integration and Experiments] §3.3 (Integration) and §4.3: the paper states that HiMAQ generalizes across IQL/SAC/RLPD but supplies no description of how the quantized macro actions are actually inserted into each algorithm's policy or replay buffer, leaving open whether observed differences arise from the hierarchy or from unstated implementation choices.
Authors: We will expand §3.3 with explicit pseudocode and textual descriptions of the integration procedure for each algorithm. The revised section will detail how macro-action sequences are encoded into the policy input, how they are sampled during training, and how they are stored and retrieved from the replay buffer for IQL, SAC, and RLPD respectively. This will eliminate ambiguity regarding implementation choices. revision: yes
Circularity Check
No circularity: empirical benchmark comparison with no derivational reduction
full rationale
The paper introduces HiMAQ as a two-level vector quantization method on human demonstrations and reports empirical outperformance versus the non-hierarchical MAQ baseline on D4RL success rates and human-likeness scores, with generalization across IQL/SAC/RLPD. No equations, derivations, or first-principles claims appear; the central result is a set of benchmark numbers whose validity rests on external data and controls rather than any quantity defined inside the paper being renamed or fitted as a prediction. Self-citation of MAQ (if present) is not load-bearing for a mathematical result because the evaluation is comparative and falsifiable on public benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Schrittwieser, I
J. Schrittwieser, I. Antonoglou, T. Hubert, K. Simonyan, L. Sifre, S. Schmitt, A. Guez, E. Lock- hart, D. Hassabis, T. Graepel, T. Lillicrap, and D. Silver. Mastering Atari, Go, chess and shogi by planning with a learned model.Nature, 588(7839):604–609, Dec. 2020
2020
-
[2]
Vinyals, I
O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, J. Oh, D. Horgan, M. Kroiss, I. Danihelka, A. Huang, L. Sifre, T. Cai, J. P. Agapiou, M. Jaderberg, A. S. Vezhnevets, R. Leblond, T. Pohlen, V . Dalibard, D. Budden, Y . Sulsky, J. Molloy, T. L. Paine, C. Gulcehre, Z. Wang, T. Pfaff,...
2019
-
[3]
Berner, G
OpenAI, C. Berner, G. Brockman, B. Chan, V . Cheung, P. Debiak, C. Dennison, D. Farhi, Q. Fischer, S. Hashme, C. Hesse, R. Jozefowicz, S. Gray, C. Olsson, J. Pachocki, M. Petrov, H. P. d. O. Pinto, J. Raiman, T. Salimans, J. Schlatter, J. Schneider, S. Sidor, I. Sutskever, J. Tang, F. Wolski, and S. Zhang. Dota 2 with Large Scale Deep Reinforcement Learni...
2019
-
[4]
Hafner, J
D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap. Mastering diverse control tasks through world models.Nature, pages 1–7, Apr. 2025
2025
-
[5]
J. Fu, A. Kumar, O. Nachum, G. Tucker, and S. Levine. D4RL: Datasets for Deep Data-Driven Reinforcement Learning, Feb. 2021
2021
-
[6]
P. J. Ball, L. Smith, I. Kostrikov, and S. Levine. Efficient Online Reinforcement Learning with Offline Data. InProceedings of the 40th International Conference on Machine Learning, pages 1577–1594. PMLR, July 2023
2023
-
[7]
Mysore, B
S. Mysore, B. Mabsout, R. Mancuso, and K. Saenko. Regularizing Action Policies for Smooth Control with Reinforcement Learning. In2021 IEEE International Conference on Robotics and Automation (ICRA), pages 1810–1816. IEEE Press, May 2021
2021
-
[8]
Lee, H.-G
I. Lee, H.-G. Cao, C.-T. Dao, Y .-C. Chen, and I.-C. Wu. Gradient-based Regularization for Action Smoothness in Robotic Control with Reinforcement Learning. In2024 IEEE/RSJ In- ternational Conference on Intelligent Robots and Systems (IROS), pages 603–610, Oct. 2024
2024
-
[9]
Q. Shen, Y . Li, H. Jiang, Z. Wang, and T. Zhao. Deep Reinforcement Learning with Robust and Smooth Policy. InProceedings of the 37th International Conference on Machine Learning, pages 8707–8718. PMLR, Nov. 2020
2020
-
[10]
W. Koch. Flight Controller Synthesis Via Deep Reinforcement Learning, Sept. 2019
2019
-
[11]
Milani, A
S. Milani, A. Juliani, I. Momennejad, R. Georgescu, J. Rzepecki, A. Shaw, G. Costello, F. Fang, S. Devlin, and K. Hofmann. Navigates Like Me: Understanding How People Evaluate Human- Like AI in Video Games. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems, CHI ’23, pages 1–18. Association for Computing Machinery, Apr. 2023
2023
-
[12]
Devlin, R
S. Devlin, R. Georgescu, I. Momennejad, J. Rzepecki, E. Zuniga, G. Costello, G. Leroy, A. Shaw, and K. Hofmann. Navigation Turing Test (NTT): Learning to Evaluate Human- Like Navigation. InProceedings of the 38th International Conference on Machine Learning, pages 2644–2653. PMLR, July 2021. 10
2021
-
[13]
Zuniga, S
E. Zuniga, S. Milani, G. Leroy, J. Rzepecki, R. Georgescu, I. Momennejad, D. Bignell, M. Sun, A. Shaw, G. Costello, M. Jacob, S. Devlin, and K. Hofmann. How Humans Perceive Human- like Behavior in Video Game Navigation. InExtended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems, CHI EA ’22, pages 1–11. Association for Comput- in...
2022
-
[14]
Ho, P.-C
K.-H. Ho, P.-C. Hsieh, C.-C. Lin, Y .-R. Lou, F.-J. Wang, and I.-C. Wu. Towards Human- Like RL: Taming Non-Naturalistic Behavior in Deep RL via Adaptive Behavioral Costs in 3D Games. InProceedings of the 15th Asian Conference on Machine Learning, pages 438–453. PMLR, Feb. 2024
2024
-
[15]
Temporal difference learning for model predictive control.arXiv preprint arXiv:2203.04955, 2022
N. Hansen, X. Wang, and H. Su. Temporal difference learning for model predictive control. arXiv preprint arXiv:2203.04955, 2022
-
[16]
Guo, Y .-C
J.-T. Guo, Y .-C. Chen, P.-C. Hsieh, K.-H. Ho, P.-W. Huang, T.-R. Wu, I. Wu, et al. Learning human-like rl agents through trajectory optimization with action quantization.Advances in Neural Information Processing Systems, 38:83534–83565, 2026
2026
-
[17]
Van Den Oord, O
A. Van Den Oord, O. Vinyals, et al. Neural discrete representation learning.Advances in neural information processing systems, 30, 2017
2017
-
[18]
Spurio, E
F. Spurio, E. Bahrami, G. Francesca, and J. Gall. Hierarchical vector quantization for unsuper- vised action segmentation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 6996–7005, 2025
2025
-
[19]
Kostrikov, A
I. Kostrikov, A. Nair, and S. Levine. Offline Reinforcement Learning with Implicit Q-Learning. InInternational Conference on Learning Representations, Oct. 2021
2021
-
[20]
Haarnoja, A
T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. InProceedings of the 35th International Conference on Machine Learning, pages 1861–1870. PMLR, July 2018
2018
-
[21]
Fujii, Y
N. Fujii, Y . Sato, H. Wakama, K. Kazai, and H. Katayose. Evaluating human-like behaviors of video-game agents autonomously acquired with biological constraints. InInternational Conference on Advances in Computer Entertainment Technology, pages 61–76. Springer, 2013
2013
-
[22]
Ho, P.-C
K.-H. Ho, P.-C. Hsieh, C.-C. Lin, Y .-R. Lou, F.-J. Wang, and I.-C. Wu. Towards human-like rl: Taming non-naturalistic behavior in deep rl via adaptive behavioral costs in 3d games. In Asian Conference on Machine Learning, pages 438–453. PMLR, 2024
2024
-
[23]
Hester, M
T. Hester, M. Vecerik, O. Pietquin, M. Lanctot, T. Schaul, B. Piot, D. Horgan, J. Quan, A. Sendonaris, I. Osband, et al. Deep q-learning from demonstrations. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018
2018
-
[24]
End to End Learning for Self-Driving Cars
M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, et al. End to end learning for self-driving cars.arXiv preprint arXiv:1604.07316, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[25]
M. Zare, P. M. Kebria, A. Khosravi, and S. Nahavandi. A survey of imitation learning: Al- gorithms, recent developments, and challenges.IEEE Transactions on Cybernetics, 54(12): 7173–7186, 2024
2024
-
[26]
Arora and P
S. Arora and P. Doshi. A survey of inverse reinforcement learning: Challenges, methods and progress.Artificial Intelligence, 297:103500, 2021
2021
-
[27]
I. P. Durugkar, C. Rosenbaum, S. Dernbach, and S. Mahadevan. Deep reinforcement learning with macro-actions.arXiv preprint arXiv:1606.04615, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[28]
D. P. Kingma and M. Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013. 11
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[29]
Ozair, Y
S. Ozair, Y . Li, A. Razavi, I. Antonoglou, A. Van Den Oord, and O. Vinyals. Vector quantized models for planning. Ininternational conference on machine learning, pages 8302–8313. PMLR, 2021
2021
-
[30]
Antonoglou, J
I. Antonoglou, J. Schrittwieser, S. Ozair, T. K. Hubert, and D. Silver. Planning in stochastic environments with a learned model. InInternational Conference on Learning Representations, 2021
2021
-
[31]
J. Luo, P. Dong, J. Wu, A. Kumar, X. Geng, and S. Levine. Action-quantized offline reinforce- ment learning for robotic skill learning. InConference on Robot Learning, pages 1348–1361. PMLR, 2023
2023
-
[32]
A. D. Vuong, M. N. Vu, D. An, and I. Reid. Action tokenizer matters in in-context imita- tion learning. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 13490–13496. IEEE, 2025
2025
-
[33]
C. G. Atkeson and S. Schaal. Robot Learning From Demonstration. InProceedings of the Fourteenth International Conference on Machine Learning, ICML ’97, pages 12–20. Morgan Kaufmann Publishers Inc., July 1997
1997
-
[34]
H. Zhou, T. Wei, Z. Lin, j. li, J. Xing, Y . Shi, L. Shen, C. Yu, and D. Ye. Revisiting Discrete Soft Actor-Critic, Nov. 2024
2024
-
[35]
Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations
A. Rajeswaran, V . Kumar, A. Gupta, G. Vezzani, J. Schulman, E. Todorov, and S. Levine. Learning complex dexterous manipulation with deep reinforcement learning and demonstra- tions.arXiv preprint arXiv:1709.10087, 2017. 12
work page internal anchor Pith review Pith/arXiv arXiv 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.