pith. sign in

arxiv: 2605.30928 · v1 · pith:CCHQBM7Xnew · submitted 2026-05-29 · 💻 cs.RO

Enhancing Human-Likeness in Reinforcement Learning Agents via Hierarchical Macro Action Quantization

Pith reviewed 2026-06-28 22:27 UTC · model grok-4.3

classification 💻 cs.RO
keywords reinforcement learninghuman-like agentsvector quantizationmacro actionsD4RLhierarchical learning
0
0 comments X

The pith

Two levels of vector quantization turn human action sequences into macro actions that let RL agents match human behavior more closely while keeping high task success.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces HiMAQ, which applies two successive vector quantizations to human demonstrations so that lower-level clusters capture fine subactions and higher-level clusters group them into reusable macro actions. Reinforcement learning agents then learn policies over these macro actions instead of raw actions, producing trajectories that score higher on human-likeness metrics. On D4RL benchmark tasks the hierarchical version beats the flat MAQ baseline on human-likeness while matching or exceeding success rates, and the gains hold when the same structure is plugged into IQL, SAC, or RLPD.

Core claim

Encoding human demonstrations into macro actions via two successive levels of vector quantization enables RL agents to generate behaviors that align more closely with humans while still maximizing rewards, and this hierarchical approach improves human-likeness scores over single-level quantization without reducing task performance.

What carries the argument

HiMAQ, a two-level hierarchical macro action quantization in which the lower level maps actions to fine-grained subaction clusters and the higher level aggregates those clusters into macro action clusters.

If this is right

  • Agents using the two-level structure outperform the non-hierarchical MAQ baseline on human-likeness while keeping comparable or better success rates.
  • The same hierarchical quantization integrates successfully with IQL, SAC, and RLPD and delivers the human-likeness gains across all three.
  • The resulting policies remain competitive with earlier RL agents on the same D4RL tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the macro actions capture reusable human movement patterns they could transfer to new tasks without retraining the quantization layers.
  • Adding a third quantization level might further improve alignment when human behavior contains distinct sub-phases within a macro action.
  • The same two-level structure could be tested on real robot hardware to check whether the human-likeness gains reduce the need for post-hoc safety filters.

Load-bearing premise

The human-likeness scores computed on D4RL data actually measure behavioral similarity to humans and the two-level structure is what produces the measured gains rather than other implementation choices.

What would settle it

Retraining the identical agents on the same data after collapsing the two quantization levels into one while keeping every other detail fixed and finding no drop in human-likeness scores would falsify the claim that the hierarchy itself is responsible.

Figures

Figures reproduced from arXiv: 2605.30928 by Ali Shah Ali, Fawad Javed Fateh, M. Shaheer Luqman, Murad Popattia, M. Zeeshan Zia, Quoc-Huy Tran, Usman Nizamani.

Figure 1
Figure 1. Figure 1: Our HiMAQ framework includes a two-stage training process: [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Impacts of different macro action lengths [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Turing test win rates (i.e., fraction of trials where evaluators mistook an agent for a human). [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Human-likeness ranking test pairwise win rates. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Human-like agents are a long-standing goal of artificial intelligence. Despite strong performance, most reinforcement learning (RL) agents remain reward-driven and often exhibit behaviors that differ from humans, limiting interpretability and reliability. In this work, we introduce a novel human-like RL framework that predicts action sequences closely aligned with human behaviors while maximizing rewards. Specifically, we encode human demonstrations into macro actions using a hierarchical macro action quantization approach (termed HiMAQ) consisting of two successive levels of vector quantization. The lower quantization level maps input actions to fine-grained subaction clusters, while the higher quantization level aggregates these subaction clusters into action clusters. Extensive evaluations on the D4RL benchmarks show that our hierarchical approach outperforms the non-hierarchical baseline (MAQ), achieving better human-likeness scores while maintaining comparable or better success rates than previous RL agents. The improvements generalize across integrations with various RL algorithms, namely IQL, SAC, and RLPD.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces HiMAQ, a two-level hierarchical vector quantization method (subaction clusters followed by action clusters) to encode human demonstrations into macro actions for RL agents. It claims this yields better human-likeness scores than the non-hierarchical MAQ baseline on D4RL benchmarks while preserving or improving success rates, with the gains generalizing when the macro actions are integrated into IQL, SAC, and RLPD.

Significance. If the hierarchy is shown to be the causal factor and the human-likeness metric is validated as capturing behavioral similarity, the framework could offer a practical route to more interpretable RL policies that align with human behavior on offline benchmarks.

major comments (3)
  1. [Abstract and Experiments] Abstract and §4 (Experiments): the human-likeness metric is never defined, nor is any validation provided that it measures behavioral similarity to humans rather than a simple distributional distance; without this, the central outperformance claim cannot be assessed.
  2. [Method and Ablations] §3 (Method) and §4.2 (Ablations): no ablation isolates the two successive VQ levels from capacity or training differences; a single-level VQ with matched total codebook size could produce equivalent scores, so the attribution of gains to hierarchy is unsupported.
  3. [Integration and Experiments] §3.3 (Integration) and §4.3: the paper states that HiMAQ generalizes across IQL/SAC/RLPD but supplies no description of how the quantized macro actions are actually inserted into each algorithm's policy or replay buffer, leaving open whether observed differences arise from the hierarchy or from unstated implementation choices.
minor comments (2)
  1. [Abstract] Abstract: numerical human-likeness and success-rate values, together with standard errors or statistical tests, are omitted.
  2. [Method] Notation in §3: the distinction between the lower-level subaction codebook and the higher-level action codebook is introduced without an explicit equation relating the two quantization steps.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will make the indicated revisions to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Abstract and Experiments] Abstract and §4 (Experiments): the human-likeness metric is never defined, nor is any validation provided that it measures behavioral similarity to humans rather than a simple distributional distance; without this, the central outperformance claim cannot be assessed.

    Authors: We agree that the human-likeness metric requires an explicit definition and validation within the main text. In the revised manuscript we will add a dedicated subsection (new §3.4) that formally defines the metric (including its computation from demonstration trajectories) and provide supporting validation experiments that compare it against direct behavioral similarity measures such as trajectory overlap and human preference ratings. These additions will be placed before the experimental results to ensure the central claims can be properly assessed. revision: yes

  2. Referee: [Method and Ablations] §3 (Method) and §4.2 (Ablations): no ablation isolates the two successive VQ levels from capacity or training differences; a single-level VQ with matched total codebook size could produce equivalent scores, so the attribution of gains to hierarchy is unsupported.

    Authors: We acknowledge the need for an ablation that controls for total codebook capacity. We will add a new experiment in §4.2 that compares HiMAQ against a single-level vector quantization baseline whose codebook size equals the product of the two HiMAQ levels. This will isolate whether the observed gains stem from the hierarchical structure rather than increased representational capacity. revision: yes

  3. Referee: [Integration and Experiments] §3.3 (Integration) and §4.3: the paper states that HiMAQ generalizes across IQL/SAC/RLPD but supplies no description of how the quantized macro actions are actually inserted into each algorithm's policy or replay buffer, leaving open whether observed differences arise from the hierarchy or from unstated implementation choices.

    Authors: We will expand §3.3 with explicit pseudocode and textual descriptions of the integration procedure for each algorithm. The revised section will detail how macro-action sequences are encoded into the policy input, how they are sampled during training, and how they are stored and retrieved from the replay buffer for IQL, SAC, and RLPD respectively. This will eliminate ambiguity regarding implementation choices. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark comparison with no derivational reduction

full rationale

The paper introduces HiMAQ as a two-level vector quantization method on human demonstrations and reports empirical outperformance versus the non-hierarchical MAQ baseline on D4RL success rates and human-likeness scores, with generalization across IQL/SAC/RLPD. No equations, derivations, or first-principles claims appear; the central result is a set of benchmark numbers whose validity rests on external data and controls rather than any quantity defined inside the paper being renamed or fitted as a prediction. Self-citation of MAQ (if present) is not load-bearing for a mathematical result because the evaluation is comparative and falsifiable on public benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no equations, no parameter counts, and no explicit modeling assumptions. Free parameters, axioms, and invented entities cannot be extracted.

pith-pipeline@v0.9.1-grok · 5728 in / 1353 out tokens · 25217 ms · 2026-06-28T22:27:46.399992+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 5 canonical work pages · 4 internal anchors

  1. [1]

    Schrittwieser, I

    J. Schrittwieser, I. Antonoglou, T. Hubert, K. Simonyan, L. Sifre, S. Schmitt, A. Guez, E. Lock- hart, D. Hassabis, T. Graepel, T. Lillicrap, and D. Silver. Mastering Atari, Go, chess and shogi by planning with a learned model.Nature, 588(7839):604–609, Dec. 2020

  2. [2]

    Vinyals, I

    O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, J. Oh, D. Horgan, M. Kroiss, I. Danihelka, A. Huang, L. Sifre, T. Cai, J. P. Agapiou, M. Jaderberg, A. S. Vezhnevets, R. Leblond, T. Pohlen, V . Dalibard, D. Budden, Y . Sulsky, J. Molloy, T. L. Paine, C. Gulcehre, Z. Wang, T. Pfaff,...

  3. [3]

    Berner, G

    OpenAI, C. Berner, G. Brockman, B. Chan, V . Cheung, P. Debiak, C. Dennison, D. Farhi, Q. Fischer, S. Hashme, C. Hesse, R. Jozefowicz, S. Gray, C. Olsson, J. Pachocki, M. Petrov, H. P. d. O. Pinto, J. Raiman, T. Salimans, J. Schlatter, J. Schneider, S. Sidor, I. Sutskever, J. Tang, F. Wolski, and S. Zhang. Dota 2 with Large Scale Deep Reinforcement Learni...

  4. [4]

    Hafner, J

    D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap. Mastering diverse control tasks through world models.Nature, pages 1–7, Apr. 2025

  5. [5]

    J. Fu, A. Kumar, O. Nachum, G. Tucker, and S. Levine. D4RL: Datasets for Deep Data-Driven Reinforcement Learning, Feb. 2021

  6. [6]

    P. J. Ball, L. Smith, I. Kostrikov, and S. Levine. Efficient Online Reinforcement Learning with Offline Data. InProceedings of the 40th International Conference on Machine Learning, pages 1577–1594. PMLR, July 2023

  7. [7]

    Mysore, B

    S. Mysore, B. Mabsout, R. Mancuso, and K. Saenko. Regularizing Action Policies for Smooth Control with Reinforcement Learning. In2021 IEEE International Conference on Robotics and Automation (ICRA), pages 1810–1816. IEEE Press, May 2021

  8. [8]

    Lee, H.-G

    I. Lee, H.-G. Cao, C.-T. Dao, Y .-C. Chen, and I.-C. Wu. Gradient-based Regularization for Action Smoothness in Robotic Control with Reinforcement Learning. In2024 IEEE/RSJ In- ternational Conference on Intelligent Robots and Systems (IROS), pages 603–610, Oct. 2024

  9. [9]

    Q. Shen, Y . Li, H. Jiang, Z. Wang, and T. Zhao. Deep Reinforcement Learning with Robust and Smooth Policy. InProceedings of the 37th International Conference on Machine Learning, pages 8707–8718. PMLR, Nov. 2020

  10. [10]

    W. Koch. Flight Controller Synthesis Via Deep Reinforcement Learning, Sept. 2019

  11. [11]

    Milani, A

    S. Milani, A. Juliani, I. Momennejad, R. Georgescu, J. Rzepecki, A. Shaw, G. Costello, F. Fang, S. Devlin, and K. Hofmann. Navigates Like Me: Understanding How People Evaluate Human- Like AI in Video Games. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems, CHI ’23, pages 1–18. Association for Computing Machinery, Apr. 2023

  12. [12]

    Devlin, R

    S. Devlin, R. Georgescu, I. Momennejad, J. Rzepecki, E. Zuniga, G. Costello, G. Leroy, A. Shaw, and K. Hofmann. Navigation Turing Test (NTT): Learning to Evaluate Human- Like Navigation. InProceedings of the 38th International Conference on Machine Learning, pages 2644–2653. PMLR, July 2021. 10

  13. [13]

    Zuniga, S

    E. Zuniga, S. Milani, G. Leroy, J. Rzepecki, R. Georgescu, I. Momennejad, D. Bignell, M. Sun, A. Shaw, G. Costello, M. Jacob, S. Devlin, and K. Hofmann. How Humans Perceive Human- like Behavior in Video Game Navigation. InExtended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems, CHI EA ’22, pages 1–11. Association for Comput- in...

  14. [14]

    Ho, P.-C

    K.-H. Ho, P.-C. Hsieh, C.-C. Lin, Y .-R. Lou, F.-J. Wang, and I.-C. Wu. Towards Human- Like RL: Taming Non-Naturalistic Behavior in Deep RL via Adaptive Behavioral Costs in 3D Games. InProceedings of the 15th Asian Conference on Machine Learning, pages 438–453. PMLR, Feb. 2024

  15. [15]

    Temporal difference learning for model predictive control.arXiv preprint arXiv:2203.04955, 2022

    N. Hansen, X. Wang, and H. Su. Temporal difference learning for model predictive control. arXiv preprint arXiv:2203.04955, 2022

  16. [16]

    Guo, Y .-C

    J.-T. Guo, Y .-C. Chen, P.-C. Hsieh, K.-H. Ho, P.-W. Huang, T.-R. Wu, I. Wu, et al. Learning human-like rl agents through trajectory optimization with action quantization.Advances in Neural Information Processing Systems, 38:83534–83565, 2026

  17. [17]

    Van Den Oord, O

    A. Van Den Oord, O. Vinyals, et al. Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

  18. [18]

    Spurio, E

    F. Spurio, E. Bahrami, G. Francesca, and J. Gall. Hierarchical vector quantization for unsuper- vised action segmentation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 6996–7005, 2025

  19. [19]

    Kostrikov, A

    I. Kostrikov, A. Nair, and S. Levine. Offline Reinforcement Learning with Implicit Q-Learning. InInternational Conference on Learning Representations, Oct. 2021

  20. [20]

    Haarnoja, A

    T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. InProceedings of the 35th International Conference on Machine Learning, pages 1861–1870. PMLR, July 2018

  21. [21]

    Fujii, Y

    N. Fujii, Y . Sato, H. Wakama, K. Kazai, and H. Katayose. Evaluating human-like behaviors of video-game agents autonomously acquired with biological constraints. InInternational Conference on Advances in Computer Entertainment Technology, pages 61–76. Springer, 2013

  22. [22]

    Ho, P.-C

    K.-H. Ho, P.-C. Hsieh, C.-C. Lin, Y .-R. Lou, F.-J. Wang, and I.-C. Wu. Towards human-like rl: Taming non-naturalistic behavior in deep rl via adaptive behavioral costs in 3d games. In Asian Conference on Machine Learning, pages 438–453. PMLR, 2024

  23. [23]

    Hester, M

    T. Hester, M. Vecerik, O. Pietquin, M. Lanctot, T. Schaul, B. Piot, D. Horgan, J. Quan, A. Sendonaris, I. Osband, et al. Deep q-learning from demonstrations. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

  24. [24]

    End to End Learning for Self-Driving Cars

    M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, et al. End to end learning for self-driving cars.arXiv preprint arXiv:1604.07316, 2016

  25. [25]

    M. Zare, P. M. Kebria, A. Khosravi, and S. Nahavandi. A survey of imitation learning: Al- gorithms, recent developments, and challenges.IEEE Transactions on Cybernetics, 54(12): 7173–7186, 2024

  26. [26]

    Arora and P

    S. Arora and P. Doshi. A survey of inverse reinforcement learning: Challenges, methods and progress.Artificial Intelligence, 297:103500, 2021

  27. [27]

    I. P. Durugkar, C. Rosenbaum, S. Dernbach, and S. Mahadevan. Deep reinforcement learning with macro-actions.arXiv preprint arXiv:1606.04615, 2016

  28. [28]

    D. P. Kingma and M. Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013. 11

  29. [29]

    Ozair, Y

    S. Ozair, Y . Li, A. Razavi, I. Antonoglou, A. Van Den Oord, and O. Vinyals. Vector quantized models for planning. Ininternational conference on machine learning, pages 8302–8313. PMLR, 2021

  30. [30]

    Antonoglou, J

    I. Antonoglou, J. Schrittwieser, S. Ozair, T. K. Hubert, and D. Silver. Planning in stochastic environments with a learned model. InInternational Conference on Learning Representations, 2021

  31. [31]

    J. Luo, P. Dong, J. Wu, A. Kumar, X. Geng, and S. Levine. Action-quantized offline reinforce- ment learning for robotic skill learning. InConference on Robot Learning, pages 1348–1361. PMLR, 2023

  32. [32]

    A. D. Vuong, M. N. Vu, D. An, and I. Reid. Action tokenizer matters in in-context imita- tion learning. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 13490–13496. IEEE, 2025

  33. [33]

    C. G. Atkeson and S. Schaal. Robot Learning From Demonstration. InProceedings of the Fourteenth International Conference on Machine Learning, ICML ’97, pages 12–20. Morgan Kaufmann Publishers Inc., July 1997

  34. [34]

    H. Zhou, T. Wei, Z. Lin, j. li, J. Xing, Y . Shi, L. Shen, C. Yu, and D. Ye. Revisiting Discrete Soft Actor-Critic, Nov. 2024

  35. [35]

    Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations

    A. Rajeswaran, V . Kumar, A. Gupta, G. Vezzani, J. Schulman, E. Todorov, and S. Levine. Learning complex dexterous manipulation with deep reinforcement learning and demonstra- tions.arXiv preprint arXiv:1709.10087, 2017. 12