pith. machine review for the scientific record. sign in

arxiv: 2605.10044 · v1 · submitted 2026-05-11 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Adaptive Action Chunking via Multi-Chunk Q Value Estimation

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:24 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords adaptive action chunkingreinforcement learningQ-value estimationtransformer architectureoffline-to-online RLaction sequencesimitation learningdynamic chunk length
0
0 comments X

The pith

ACH lets RL agents dynamically choose action sequence lengths by estimating values for all candidates in one Transformer forward pass.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Adaptive Action Chunking as an offline-to-online reinforcement learning algorithm that removes the need for a fixed action sequence length. Instead of committing to one chunk size for an entire task, the method uses a Transformer to compute Q-values for every possible chunk length at the current state in a single forward pass. The agent then picks the length that yields the highest estimated value. This adaptation happens during both training and inference. On 34 challenging tasks the approach yields higher returns and faster learning than any fixed-length baseline.

Core claim

ACH dynamically modulates chunk length by simultaneously estimating action-values for all candidate chunk lengths in a single forward pass of a Transformer-based architecture, allowing the agent to select the most effective chunk length adaptively based on the current state.

What carries the argument

Multi-chunk Q-value estimation, in which one Transformer forward pass produces separate Q-value estimates for every candidate chunk length so the policy can choose the best length per state.

If this is right

  • Agents can improve behavioral consistency and reduce bootstrapping errors by adapting chunk length to the current state.
  • The same architecture supports both offline pre-training and online fine-tuning without extra training signals.
  • Performance improves on 34 diverse tasks, indicating better generalization across environments with varying optimal horizons.
  • Training and inference cost remain comparable to a single fixed-length model because all estimates share one forward pass.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could lower the cost of hyperparameter search by removing the need to tune chunk length in advance.
  • The single-pass multi-scale estimation pattern may transfer to other sequence-decision problems such as option discovery or hierarchical RL.
  • In real-time control settings the adaptive choice might reduce unnecessary commitment to long plans when the environment changes rapidly.

Load-bearing premise

A single Transformer forward pass can produce accurate and non-interfering Q-value estimates for every candidate chunk length at once.

What would settle it

An ablation that replaces the shared Transformer head with independent Q-heads for each chunk length and measures whether the performance gap over fixed-length baselines disappears.

Figures

Figures reproduced from arXiv: 2605.10044 by Jongeui Park, Jongseong Chae, Seongmin Kim, Yongjae Shin, Youngchul Sung.

Figure 1
Figure 1. Figure 1: To achieve this, we leverage a Transformer-based value function capable of simultaneously [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Action value as a function of chunk length ( [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The overall architecture of ACH: (left) causal Transformer-based joint value learning and [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Learning curves over 1M online steps across robomimic tasks. The performance is averaged [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Behavioral comparison of ACH and fixed-length baselines in the [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Experimental comparison of ACH against alternative training and sampling strate￾gies. Each plot presents the aggregated perfor￾mance across five distinct tasks. Ablation Study Section 4 presents the ratio￾nale for the value-based chunk length selection criterion (Section 4.1), describes the training of the value function and policy (Section 4.2), and details the adaptive chunk length sampling pro￾cedure (S… view at source ↗
Figure 7
Figure 7. Figure 7: Evolution of av￾erage chunk length during the online learning phase on puzzle-4x4. Evolution of Chunk Length In Off2On RL, the primary role of online learning is to enable the agent to refine its behavior through direct environment interaction. While methods with a fixed chunk length are limited to adjusting individual actions, our approach fa￾cilitates both behavioral correction and chunk length optimizat… view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of computational costs between the proposed method and baselines on the [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Behavioral comparison of ACH and fixed-length baselines in the [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Learning curves during online learning for all tasks in OGBench. Solid lines represent the [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
read the original abstract

Action chunking emerged as a pivotal technique in imitation learning, enabling policies to predict cohesive action sequences rather than single actions. Recently, this approach has expanded to reinforcement learning (RL), enhancing behavioral consistency and reducing bootstrapping errors in value function estimation. However, existing methods rely on a fixed chunk length, creating a performance bottleneck as the optimal length varies across states and tasks. In this paper, we propose Adaptive Action CHunking (ACH), a novel offline-to-online RL algorithm that dynamically modulates chunk length during both training and inference. To find the optimal chunk length for a dynamically varying current state, we simultaneously estimate action-values for all candidate chunk lengths in a single forward pass, using a Transformer-based architecture. Our mechanism allows the agent to select the most effective chunk length adaptively based on the current state. Evaluated on 34 challenging tasks, ACH consistently outperforms fixed-length baselines, demonstrating superior generalization and learning efficiency in complex environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Adaptive Action Chunking (ACH), an offline-to-online RL algorithm that dynamically selects action chunk length at each state by estimating Q-values for all candidate lengths simultaneously via a single Transformer forward pass and choosing the length with the highest Q-value. It claims this adaptive mechanism yields consistent outperformance over fixed-length chunking baselines on 34 challenging tasks, with gains in generalization and learning efficiency.

Significance. If the multi-chunk Q-estimation mechanism proves reliable, the approach could meaningfully advance action chunking in RL by removing the fixed-length bottleneck that limits prior methods. The single-pass architecture offers computational efficiency, and the scale of evaluation (34 tasks) provides a reasonable testbed for generalization claims.

major comments (2)
  1. [Method (multi-chunk Q estimation) and Experiments] The core technical claim (abstract and method section) rests on the Transformer producing accurate, comparable, and non-interfering Q-value estimates for multiple chunk lengths in one forward pass. No ablations, separate-head controls, auxiliary disentanglement losses, or per-length calibration experiments are described to isolate whether representational interference occurs or whether selection is driven by the adaptive mechanism rather than other architectural factors.
  2. [Experiments and Results] The experimental results (abstract) report consistent outperformance on 34 tasks but provide no details on run-to-run variance, statistical significance tests, or the exact composition of the fixed-length baselines (e.g., which lengths were tested and how they were chosen). This makes it difficult to assess whether the reported gains are robust or sensitive to hyperparameter choices.
minor comments (2)
  1. [Abstract] The abstract would benefit from explicitly stating the set of candidate chunk lengths considered and the precise Transformer architecture (number of layers, attention heads, output heads) to aid reproducibility.
  2. [Method] Notation for the multi-chunk Q-function and the selection rule should be formalized with equations in the method section rather than left at a high-level description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of the multi-chunk Q estimation and experimental details.

read point-by-point responses
  1. Referee: [Method (multi-chunk Q estimation) and Experiments] The core technical claim (abstract and method section) rests on the Transformer producing accurate, comparable, and non-interfering Q-value estimates for multiple chunk lengths in one forward pass. No ablations, separate-head controls, auxiliary disentanglement losses, or per-length calibration experiments are described to isolate whether representational interference occurs or whether selection is driven by the adaptive mechanism rather than other architectural factors.

    Authors: We agree that additional controls would better isolate the contribution of the joint estimation mechanism. In the revised manuscript we will add an ablation comparing the shared Transformer against a multi-head architecture with separate Q-heads per chunk length, along with an analysis of Q-value correlations across lengths to check for interference. These results will clarify whether adaptive selection is driven by the proposed mechanism. revision: yes

  2. Referee: [Experiments and Results] The experimental results (abstract) report consistent outperformance on 34 tasks but provide no details on run-to-run variance, statistical significance tests, or the exact composition of the fixed-length baselines (e.g., which lengths were tested and how they were chosen). This makes it difficult to assess whether the reported gains are robust or sensitive to hyperparameter choices.

    Authors: We acknowledge the need for greater experimental transparency. The revised version will report run-to-run variance with standard deviations across multiple seeds, include statistical significance tests comparing ACH to the fixed-length baselines, and explicitly describe the baseline chunk lengths tested together with the selection criteria used. revision: yes

Circularity Check

0 steps flagged

No circularity: new algorithmic proposal with no self-referential derivations or fitted predictions.

full rationale

The paper introduces ACH as a novel offline-to-online RL algorithm that uses a Transformer to simultaneously estimate Q-values for multiple chunk lengths in one forward pass, then selects the best length adaptively. No equations, derivations, or parameter-fitting steps are described that reduce to prior quantities by construction. The method is presented as an empirical algorithm evaluated on 34 tasks, with no load-bearing self-citations, uniqueness theorems, or ansatzes imported from the authors' prior work. The central claim (adaptive chunking via multi-chunk Q estimation) is independent of its own outputs and does not rename known results or call fitted inputs predictions. This is a standard non-circular algorithmic contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides insufficient detail to enumerate specific free parameters, axioms, or invented entities; standard RL assumptions (Markov property, value-function approximation) are implicitly used but not stated.

pith-pipeline@v0.9.0 · 5471 in / 1129 out tokens · 39778 ms · 2026-05-12T02:24:48.153441+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 9 internal anchors

  1. [1]

    Michael Albergo and Eric Vanden-Eijnden. 2023. Building Normalizing Flows with Stochastic Interpolants. InThe Eleventh International Conference on Learning Representations

  2. [2]

    Philip J Ball, Laura Smith, Ilya Kostrikov, and Sergey Levine. 2023. Efficient online reinforce- ment learning with offline data. InInternational Conference on Machine Learning. PMLR, 1577–1594

  3. [3]

    Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. 2025. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734(2025)

  4. [4]

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. 2024. π0: A Vision-Language-Action Flow Model for General Robot Control.arXiv preprint arXiv:2410.24164(2024)

  5. [5]

    Kevin Black, Manuel Y Galliker, and Sergey Levine. [n. d.]. Real-time execution of action chunking flow policies, 2025.URL https://arxiv. org/abs/2506.07339([n. d.])

  6. [6]

    Jongseong Chae, Jongeui Park, Yongjae Shin, Gyeongmin Kim, Seungyul Han, and Youngchul Sung. [n. d.]. Flow Actor-Critic for Offline Reinforcement Learning. InThe Fourteenth Interna- tional Conference on Learning Representations

  7. [7]

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. 2025. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research44, 10-11 (2025), 1684–1704

  8. [8]

    Perry Dong, Qiyang Li, Dorsa Sadigh, and Chelsea Finn. 2025. Expo: Stable reinforcement learning with expressive policies.arXiv preprint arXiv:2507.07986(2025)

  9. [9]

    Philippe Hansen-Estruch, Ilya Kostrikov, Michael Janner, Jakub Grudzien Kuba, and Sergey Levine. 2023. Idql: Implicit q-learning as an actor-critic method with diffusion policies.arXiv preprint arXiv:2304.10573(2023)

  10. [10]

    Dan Hendrycks and Kevin Gimpel. 2016. Gaussian Error Linear Units (GELUs).arXiv preprint arXiv:1606.08415(2016)

  11. [11]

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. Advances in neural information processing systems33 (2020), 6840–6851

  12. [12]

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. 2025. π0.5: a Vision- Language-Action Model with Open-World Generalization.arXiv preprint arXiv:2504.16054 (2025)

  13. [13]

    Changyeon Kim, Haeone Lee, Younggyo Seo, Kimin Lee, and Yuke Zhu. 2026. DEAS: DEtached value learning with Action Sequence for Scalable Offline RL. InThe Fourteenth International Conference on Learning Representations. https://openreview.net/forum? id=bVTaAXeBmE

  14. [14]

    Moo Jin Kim, Chelsea Finn, and Percy Liang. 2025. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645(2025)

  15. [15]

    Kingma and Jimmy Ba

    Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings

  16. [16]

    Ilya Kostrikov, Ashvin Nair, and Sergey Levine. 2022. Offline Reinforcement Learning with Implicit Q-Learning. InInternational Conference on Learning Representations

  17. [17]

    Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. 2020. Conservative q-learning for offline reinforcement learning.Advances in neural information processing systems33 (2020), 1179–1191. 10

  18. [18]

    Seunghyun Lee, Younggyo Seo, Kimin Lee, Pieter Abbeel, and Jinwoo Shin. 2022. Offline-to- online reinforcement learning via balanced replay and pessimistic q-ensemble. InConference on Robot Learning. PMLR, 1702–1712

  19. [19]

    Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. 2020. Offline reinforcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643 (2020)

  20. [20]

    Ge Li, Dong Tian, Hongyi Zhou, Xinkai Jiang, Rudolf Lioutikov, and Gerhard Neumann. 2025. TOP-ERL: Transformer-based Off-Policy Episodic Reinforcement Learning. InThe Thirteenth International Conference on Learning Representations

  21. [21]

    Qiyang Li, Seohong Park, and Sergey Levine. 2026. Decoupled Q-Chunking. InThe Fourteenth International Conference on Learning Representations. https://openreview.net/forum? id=aqGNdZQL9l

  22. [22]

    Qiyang Li, Zhiyuan Zhou, and Sergey Levine. 2025. Reinforcement Learning with Action Chunking. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

  23. [23]

    Yuanchang Liang, Shikai Li, Shiji Song, and Gao Huang. 2026. Adaptive Action Chunking at Inference-time for Vision-Language-Action Models. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR). arXiv preprint arXiv:2604.04161

  24. [24]

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. 2023. Flow Matching for Generative Modeling. InThe Eleventh International Conference on Learning Representations

  25. [25]

    Qin-Wen Luo, Ming-Kun Xie, Yewen Wang, and Sheng-Jun Huang. 2024. Optimistic critic reconstruction and constrained fine-tuning for general offline-to-online RL.Advances in Neural Information Processing Systems37 (2024), 108167–108207

  26. [26]

    Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Martín-Martín. 2021. What Matters in Learning from Offline Human Demonstrations for Robot Manipulation. In5th Annual Conference on Robot Learning

  27. [27]

    CF Nagy, Onur Celik, Emiliyan Gospodinov, Florian Seligmann, Weiran Liao, Aryan Kaushik, and Gerhard Neumann. 2026. SEAR: Sample Efficient Action Chunking Reinforcement Learning.arXiv preprint arXiv:2603.01891(2026)

  28. [28]

    Mitsuhiko Nakamoto, Simon Zhai, Anikait Singh, Max Sobol Mark, Yi Ma, Chelsea Finn, Aviral Kumar, and Sergey Levine. 2023. Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning.Advances in Neural Information Processing Systems36 (2023), 62244–62269

  29. [29]

    Kwanyoung Park, Seohong Park, Youngwoon Lee, and Sergey Levine. 2026. Scalable Offline Model-Based RL with Action Chunks. InThe Fourteenth International Conference on Learning Representations.https://openreview.net/forum?id=WXGb9unEHo

  30. [30]

    Seohong Park, Kevin Frans, Benjamin Eysenbach, and Sergey Levine. 2025. OGBench: Benchmarking Offline Goal-Conditioned RL. InThe Thirteenth International Conference on Learning Representations

  31. [31]

    Seohong Park, Kevin Frans, Deepinder Mann, Benjamin Eysenbach, Aviral Kumar, and Sergey Levine. 2025. Horizon Reduction Makes RL Scalable. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

  32. [32]

    Seohong Park, Qiyang Li, and Sergey Levine. 2025. Flow Q-Learning. InInternational Conference on Machine Learning (ICML)

  33. [33]

    Kohei Sendai, Maxime Alvarez, Tatsuya Matsushima, Yutaka Matsuo, and Yusuke Iwasawa

  34. [34]

    Leave no observation behind: Real-time correction for vla action chunks.arXiv preprint arXiv:2509.23224(2025). 11

  35. [35]

    Younggyo Seo and Pieter Abbeel. 2025. Coarse-to-fine Q-Network with Action Sequence for Data-Efficient Reinforcement Learning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems.https://openreview.net/forum?id=VoFXUNc9Zh

  36. [36]

    Yongjae Shin, Jongseong Chae, Jongeui Park, and Youngchul Sung. 2026. Flow Matching with Injected Noise for Offline-to-Online Reinforcement Learning. InThe Fourteenth Inter- national Conference on Learning Representations. https://openreview.net/forum?id= 6wd38R8L0Z

  37. [37]

    Yongjae Shin, Jeonghye Kim, Whiyoung Jung, Sunghoon Hong, Deunsol Yoon, Youngsoo Jang, Geon-Hyeong Kim, Jongseong Chae, Youngchul Sung, Kanghoon Lee, and Woohyung Lim. 2025. Online Pre-Training for Offline-to-Online Reinforcement Learning. InForty-second International Conference on Machine Learning

  38. [38]

    Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. Deep unsupervised learning using nonequilibrium thermodynamics. InInternational conference on machine learning. pmlr, 2256–2265

  39. [39]

    Gwanwoo Song, Kwanyoung Park, and Youngwoon Lee. 2026. Chunk-Guided Q-Learning. arXiv preprint arXiv:2603.13971(2026)

  40. [40]

    1998.Reinforcement learning: An introduction

    Richard S Sutton, Andrew G Barto, et al. 1998.Reinforcement learning: An introduction. V ol. 1. MIT press Cambridge

  41. [41]

    Denis Tarasov, Vladislav Kurenkov, Alexander Nikulin, and Sergey Kolesnikov. 2023. Revisiting the minimalist approach to offline reinforcement learning.Advances in Neural Information Processing Systems36 (2023), 11592–11620

  42. [42]

    Dong Tian, Onur Celik, and Gerhard Neumann. 2026. Chunking the Critic: A Transformer- based Soft Actor-Critic with N-Step Returns. InThe Fourteenth International Conference on Learning Representations.https://openreview.net/forum?id=rb5eTktqbc

  43. [43]

    Shenzhi Wang, Qisen Yang, Jiawei Gao, Matthieu Lin, Hao Chen, Liwei Wu, Ning Jia, Shiji Song, and Gao Huang. 2023. Train once, get a family: State-adaptive balances for offline-to- online reinforcement learning.Advances in Neural Information Processing Systems36 (2023), 47081–47104

  44. [44]

    Jialong Wu, Haixu Wu, Zihan Qiu, Jianmin Wang, and Mingsheng Long. 2022. Supported policy optimization for offline reinforcement learning.Advances in Neural Information Processing Systems35 (2022), 31278–31291

  45. [45]

    Jiarui Yang, Bin Zhu, Jingjing Chen, and Yu-Gang Jiang. 2026. Actor-critic for continuous action chunks: A reinforcement learning framework for long-horizon robotic manipulation with sparse reward. InProceedings of the AAAI Conference on Artificial Intelligence, V ol. 40. 18692–18700

  46. [46]

    Zishun Yu and Xinhua Zhang. 2023. Actor-critic alignment for offline-to-online reinforcement learning. InInternational Conference on Machine Learning. PMLR, 40452–40474

  47. [47]

    Haichao Zhang, Wei Xu, and Haonan Yu. 2023. Policy Expansion for Bridging Offline- to-Online Reinforcement Learning. InThe Eleventh International Conference on Learning Representations

  48. [48]

    Yinmin Zhang, Jie Liu, Chuming Li, Yazhe Niu, Yaodong Yang, Yu Liu, and Wanli Ouyang

  49. [49]

    In Proceedings of the AAAI Conference on Artificial Intelligence, V ol

    A Perspective of Q-value Estimation on Offline-to-Online Reinforcement Learning. In Proceedings of the AAAI Conference on Artificial Intelligence, V ol. 38. 16908–16916

  50. [50]

    Tony Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. 2023. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware.Robotics: Science and Systems XIX(2023)

  51. [51]

    Zhiyuan Zhou, Andy Peng, Qiyang Li, Sergey Levine, and Aviral Kumar. 2024. Efficient Online Reinforcement Learning Fine-Tuning Need Not Retain Offline Data. InThe Thirteenth International Conference on Learning Representations. 12 A Limitations Since the proposed method employs a Transformer as the value function, which is relatively heavier than the MLP ...