pith. sign in

arxiv: 2606.11408 · v1 · pith:LJHWDI4Tnew · submitted 2026-06-09 · 💻 cs.RO

Dynamic Execution Horizon Prediction for Chunk-based Robot Policies

Pith reviewed 2026-06-27 12:46 UTC · model grok-4.3

classification 💻 cs.RO
keywords action chunkingrobot manipulationdynamic horizon predictionreinforcement learningpolicy adaptationopen-loop executionfine-grained tasks
0
0 comments X

The pith

Dynamic Execution Horizon Prediction adapts chunk execution lengths on frozen policies to raise success on precise robot tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that fixed execution horizons in action-chunking robot policies limit performance on fine-grained tasks because they force open-loop behavior during each chunk. It introduces Dynamic Execution Horizon Prediction, a lightweight branch trained with online reinforcement learning while the base chunk policy remains completely frozen. This branch learns to output variable horizons that shorten during precision stages and lengthen during free-space motion. A sympathetic reader would care because the method improves task success without retraining or inspecting the original policy, offering a modular way to add reactivity to existing chunk-based systems.

Core claim

Dynamic Execution Horizon Prediction (DEHP) trains a lightweight execution-horizon prediction branch using online reinforcement learning while keeping the pretrained chunk policy completely frozen. This makes the method compatible with black-box chunk policies and isolates the effect of adapting the execution horizon from changes to the underlying action generator. DEHP predicts shorter execution horizons during fine-grained stages of the task and longer horizons during free-space motion, balancing the efficiency of open-loop chunk execution with the reactivity of closed-loop single-step control and improving success rates on high-precision and long-horizon manipulation tasks.

What carries the argument

Lightweight execution-horizon prediction branch trained with online RL on a frozen pretrained chunk policy.

If this is right

  • DEHP applies to any black-box chunk policy without internal changes or retraining.
  • The predictor selects shorter horizons for fine manipulation and longer ones for free-space motion.
  • Success rates rise substantially across the tested high-precision and long-horizon tasks.
  • The separation of horizon prediction from action generation keeps the base policy unchanged.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same lightweight-branch idea could be tested on other policy outputs such as termination signals or uncertainty estimates.
  • Task-specific horizon tuning might become unnecessary if the RL branch generalizes across related manipulation skills.
  • Adding the branch after policy training could serve as a low-cost way to retrofit older chunk policies for more reactive use.

Load-bearing premise

A lightweight horizon-prediction branch trained with online RL on a completely frozen pretrained chunk policy can reliably learn task-stage-appropriate horizons without access to or modification of the base policy internals.

What would settle it

Running the same evaluations on high-precision and long-horizon tasks and finding either no rise in success rate or no systematic shortening of horizons during fine-grained stages would falsify the central claim.

read the original abstract

Action chunking has become a standard design in modern robot policies, from diffusion/flow policies to vision-language-action models, where the policy predicts a sequence of actions and executes a fixed number of them instead of acting one step at a time. However, this paradigm relies on a key assumption: a fixed execution horizon. During chunk execution, the policy operates open-loop, which is particularly problematic for fine-grained manipulation tasks that require frequent replanning. In practice, the execution horizon is typically chosen through empirical tuning and is highly task-dependent. To this end, we propose Dynamic Execution Horizon Prediction (DEHP), an effective method that trains a lightweight execution-horizon prediction branch using online reinforcement learning while keeping the pretrained chunk policy completely frozen. This makes the method compatible with black-box chunk policies and isolates the effect of adapting the execution horizon from changes to the underlying action generator. Across our evaluations, DEHP improves the success rate of different high-precision and long-horizon manipulation tasks by a large margin. Our qualitative analysis further shows that DEHP predicts shorter execution horizons during fine-grained stages of the task and longer horizons during free-space motion. In this way, DEHP balances the efficiency of open-loop chunk execution with the reactivity of closed-loop single-step control. Project page: https://dehp-chunking.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes Dynamic Execution Horizon Prediction (DEHP), which adds a lightweight horizon-prediction branch trained via online RL to a frozen pretrained chunk-based robot policy. The central claim is that this yields large-margin success-rate gains on high-precision and long-horizon manipulation tasks by dynamically selecting shorter execution horizons during fine-grained stages and longer horizons during free-space motion, thereby balancing open-loop chunk efficiency with closed-loop reactivity.

Significance. If the experimental results hold, the approach would be a practical contribution for adapting black-box chunk policies (diffusion, flow, or VLA models) without internal modification or retraining. The isolation of the horizon branch and the reported stage-dependent behavior address a real limitation of fixed-horizon chunking in contact-rich tasks.

major comments (2)
  1. [Abstract] Abstract: the claim that DEHP 'improves the success rate ... by a large margin' is presented without any quantitative results, baselines, trial counts, statistical tests, or ablation studies. This is load-bearing for the central claim and prevents assessment of whether gains are attributable to dynamic horizons rather than other factors.
  2. [Method] Method / Training description: the horizon branch is trained solely with downstream task reward while the chunk policy remains completely frozen and inaccessible. Given that manipulation rewards are typically sparse and delayed, it is unclear how the branch can reliably discover the fine-grained vs. free-space distinction asserted in the qualitative analysis; no auxiliary losses, feature access, or shaping are described that would supply the necessary signal.
minor comments (1)
  1. [Abstract] The abstract and introduction would benefit from a concise statement of the specific tasks, robot platforms, and base policies used in the evaluations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate planned revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that DEHP 'improves the success rate ... by a large margin' is presented without any quantitative results, baselines, trial counts, statistical tests, or ablation studies. This is load-bearing for the central claim and prevents assessment of whether gains are attributable to dynamic horizons rather than other factors.

    Authors: We agree that the abstract lacks the quantitative details needed to support the central claim. In the revised manuscript we will incorporate specific success-rate deltas, trial counts, baseline comparisons, and references to the statistical tests and ablations already present in the experimental section. revision: yes

  2. Referee: [Method] Method / Training description: the horizon branch is trained solely with downstream task reward while the chunk policy remains completely frozen and inaccessible. Given that manipulation rewards are typically sparse and delayed, it is unclear how the branch can reliably discover the fine-grained vs. free-space distinction asserted in the qualitative analysis; no auxiliary losses, feature access, or shaping are described that would supply the necessary signal.

    Authors: The horizon branch is trained exclusively with the downstream task reward while the chunk policy stays frozen, exactly as described. The fine-grained versus free-space behavior emerges from the online RL objective: horizon choices that improve task completion receive higher return, and the qualitative results confirm the learned policy exhibits the desired stage-dependent pattern. We will expand the method section with additional discussion of the reward signal and training dynamics to clarify this point. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical RL training on frozen policy with external task rewards

full rationale

The paper introduces DEHP as a lightweight horizon-prediction branch trained via online RL on a completely frozen pretrained chunk policy. The central claim of success-rate gains rests on empirical evaluations across tasks rather than any mathematical derivation or self-referential definition. No equations are presented that equate a 'prediction' to a fitted input by construction, and no load-bearing self-citations or uniqueness theorems reduce the method to its own inputs. The training signal (downstream task reward) is external to the branch's output, satisfying the non-circularity criteria.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review yields minimal ledger entries; the approach relies on standard RL training assumptions and the domain premise that variable horizons are beneficial.

axioms (1)
  • domain assumption Fixed execution horizons are suboptimal for fine-grained manipulation tasks requiring replanning
    Core motivation stated in the first paragraph of the abstract.

pith-pipeline@v0.9.1-grok · 5792 in / 1088 out tokens · 22710 ms · 2026-06-27T12:46:54.159087+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 9 linked inside Pith

  1. [1]

    From imitation to refinement – residual rl for precise assembly, 2024.https://arxiv.org/abs/2407.16677

    Lars Ankile, Anthony Simeonov, Idan Shenfeld, Marcel Torne, and Pulkit Agrawal. From imitation to refinement – residual rl for precise assembly, 2024.https://arxiv.org/abs/2407.16677

  2. [2]

    A distributional perspective on reinforcement learning

    Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. In International conference on machine learning, pages 449–458. Pmlr, 2017

  3. [3]

    https://arxiv.org/abs/2410.24164

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky.π0: A visio...

  4. [4]

    Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 2024

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 2024

  5. [5]

    Stop regressing: Training value functions via classification for scalable deep rl

    Jesse Farebrother, Jordi Orbay, Quan Vuong, Adrien Ali Taiga, Yevgen Chebotar, Ted Xiao, Alex Irpan, Sergey Levine, Pablo Samuel Castro, Aleksandra Faust, et al. Stop regressing: Training value functions via classification for scalable deep rl. InInternational Conference on Machine Learning, pages 13049–13071. PMLR, 2024

  6. [6]

    Minho Heo, Youngwoon Lee, Doohyun Lee, and Joseph J. Lim. Furniturebench: Reproducible real-world benchmark for long-horizon complex manipulation. InRobotics: Science and Systems, 2023

  7. [7]

    Improving regression performance with distributional losses

    Ehsan Imani and Martha White. Improving regression performance with distributional losses. InInternational conference on machine learning, pages 2157–2166. PMLR, 2018

  8. [8]

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y. Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch...

  9. [9]

    Mixture of horizons in action chunking.arXiv preprint arXiv:2511.19433, 2025

    Dong Jing, Gang Wang, Jiaqi Liu, Weiliang Tang, Zelong Sun, Yunchao Yao, Zhenyu Wei, Yunhui Liu, Zhiwu Lu, and Mingyu Ding. Mixture of horizons in action chunking.arXiv preprint arXiv:2511.19433, 2025

  10. [10]

    Reinforcement learning with action chunking

    Qiyang Li, Zhiyuan Zhou, and Sergey Levine. Reinforcement learning with action chunking. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025.https://openreview.net/forum?id=XUks 1Y96NR

  11. [11]

    Decoupled q-chunking

    Qiyang Li, Seohong Park, and Sergey Levine. Decoupled q-chunking. InThe Fourteenth International Conference on Learning Representations, 2026.https://openreview.net/forum?id=aqGNdZQL9l

  12. [12]

    Adaptive action chunking at inference-time for vision-language-action models

    Yuanchang Liang, Xiaobo Wang, Kai Wang, Shuo Wang, Xiaojiang Peng, Haoyu Chen, David Kim Huat Chua, and Vadakkepat Prahlad. Adaptive action chunking at inference-time for vision-language-action models. InCVPR, 2026

  13. [13]

    RDT-1b: a diffusion foundation model for bimanual manipulation

    Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. RDT-1b: a diffusion foundation model for bimanual manipulation. InThe Thirteenth International Conference on Learning Representations, 2025.https://openreview.net/forum?id=yAzN4tz7oI

  14. [14]

    Bidirectional decoding: Improving action chunking via guided test-time sampling

    Yuejiang Liu, Jubayer Ibn Hamid, Annie Xie, Yoonho Lee, Max Du, and Chelsea Finn. Bidirectional decoding: Improving action chunking via guided test-time sampling. InThe Thirteenth International Conference on Learning Representations, 2025.https://openreview.net/forum?id=qZmn2hkuzw

  15. [15]

    Carlson, Ji Yuan Feng, Animesh Garg, Renato Gasoto, Lionel Gulich, Yijie Guo, M

    Mayank Mittal, Pascal Roth, James Tigue, Antoine Richard, Octi Zhang, Peter Du, Antonio Serrano-Muñoz, Xinjie Yao, René Zurbrügg, Nikita Rudin, Lukasz Wawrzyniak, Milad Rakhsha, Alain Denzler, Eric Heiden, Ales Borovicka, Ossama Ahmed, Iretiayo Akinola, Abrar Anwar, Mark T. Carlson, Ji Yuan Feng, Animesh Garg, Renato Gasoto, Lionel Gulich, Yijie Guo, M. G...

  16. [16]

    GR00T N1: An open foundation model for generalist humanoid robots

    NVIDIA, Johan Bjorck, Nikita Cherniadev Fernando Castañeda, Xingye Da, Runyu Ding, Linxi "Jim" Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, Joel Jang, Zhenyu Jiang, Jan Kautz, Kaushil Kundalia, Lawrence Lao, Zhiqi Li, Zongyu Lin, Kevin Lin, Guilin Liu, Edith Llontop, Loic Magne, Ajay Mandlekar, Avnish Narayan, Soroush Nasiriany, Scott Reed, You L...

  17. [17]

    Octo: An open-source generalist robot policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Charles Xu, Jianlan Luo, Tobias Kreiman, You Liang Tan, Lawrence Yunliang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy. InProceedings of Robotics: Science and...

  18. [18]

    Ren, Justin Lidard, Lars Lien Ankile, Anthony Simeonov, Pulkit Agrawal, Anirudha Majumdar, Benjamin Burchfiel, Hongkai Dai, and Max Simchowitz

    Allen Z. Ren, Justin Lidard, Lars Lien Ankile, Anthony Simeonov, Pulkit Agrawal, Anirudha Majumdar, Benjamin Burchfiel, Hongkai Dai, and Max Simchowitz. Diffusion policy policy optimization. InThe Thirteenth International Conference on Learning Representations, 2025.https://openreview.net/forum?id=mEpqHvbD2h

  19. [19]

    Proximal policy optimization algorithms, 2017.https://arxiv.org/abs/1707.06347

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017.https://arxiv.org/abs/1707.06347

  20. [20]

    Improving generative behavior cloning via self-guidance and adaptive chunking

    Junhyuk So, Chiwoong Lee, Shinyoung Lee, Jungseul Ok, and Eunhyeok Park. Improving generative behavior cloning via self-guidance and adaptive chunking. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025.https://openreview.net/forum?id=GctsZXLCpl

  21. [21]

    Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning.Artificial intelligence, 112(1-2):181–211, 1999

    Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning.Artificial intelligence, 112(1-2):181–211, 1999

  22. [22]

    A careful examination of large behavior models for multitask dexterous manipulation

    TRI LBM Team, Jose Barreiros, Andrew Beaulieu, Aditya Bhat, Rick Cory, Eric Cousineau, Hongkai Dai, Ching- Hsin Fang, Kunimatsu Hashimoto, Muhammad Zubair Irshad, Masha Itkina, Naveen Kuppuswamy, Kuan-Hui Lee, Katherine Liu, Dale McConachie, Ian McMahon, Haruki Nishimura, Calder Phillips-Grafflin, Charles Richter, Paarth Shah, Krishnan Srinivasan, Blake W...

  23. [23]

    Steering your diffusion policy with latent space reinforcement learning

    Andrew Wagenmaker, Mitsuhiko Nakamoto, Yunchu Zhang, Seohong Park, Waleed Yagoub, Anusha Nagabandi, Abhishek Gupta, and Sergey Levine. Steering your diffusion policy with latent space reinforcement learning. Conference on Robot Learning, 2025

  24. [24]

    Temporal action selection for action chunking, 2025.https://arxiv.org/abs/2511.04421

    Yueyang Weng, Xiaopeng Zhang, Yongjin Mu, Yingcong Zhu, Yanjie Li, and Qi Liu. Temporal action selection for action chunking, 2025.https://arxiv.org/abs/2511.04421

  25. [25]

    Self-improving vision-language-action models with data generation via residual rl, 2025.https://arxiv.org/abs/2511.00091

    Wenli Xiao, Haotian Lin, Andy Peng, Haoru Xue, Tairan He, Yuqi Xie, Fengyuan Hu, Jimmy Wu, Zhengyi Luo, Linxi "Jim" Fan, Guanya Shi, and Yuke Zhu. Self-improving vision-language-action models with data generation via residual rl, 2025.https://arxiv.org/abs/2511.00091

  26. [26]

    Hipolicy: Hierarchical multi-frequency action chunking for policy learning.arXiv preprint arXiv:2604.06067, 2026

    Jiyao Zhang, Zimu Han, Junhan Wang, Xionghao Wu, Shihong Lin, Jinzhou Li, Hongwei Fan, Ruihai Wu, 11 Dongjiang Li, and Hao Dong. Hipolicy: Hierarchical multi-frequency action chunking for policy learning.arXiv preprint arXiv:2604.06067, 2026

  27. [27]

    Zhang, Daniel Pfrommer, Chaoyi Pan, Nikolai Matni, and Max Simchowitz

    Thomas T. Zhang, Daniel Pfrommer, Chaoyi Pan, Nikolai Matni, and Max Simchowitz. Action chunking and exploratory data collection yield exponential improvements in behavior cloning for continuous control, 2025. https://arxiv.org/abs/2507.09061

  28. [28]

    Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn

    Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware, 2023.https://arxiv.org/abs/2304.13705. 12 A Appendix A.1 Return invariance Let π be a chunking policy with execution horizonshk ∈ { 1, . . . , H}, and let the chunk start times be t0 = 0and tk+1 = tk + hk. With the within-chunk ...