pith. machine review for the scientific record. sign in

arxiv: 2506.07339 · v2 · submitted 2025-06-09 · 💻 cs.RO · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Real-Time Execution of Action Chunking Flow Policies

Authors on Pith no claims yet

Pith reviewed 2026-05-15 14:13 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.LG
keywords real-time chunkingaction chunkingvision-language-action modelsdiffusion policiesflow matchingroboticsinference latencybimanual manipulation
0
0 comments X

The pith

Real-time chunking generates the next action chunk while executing the current one by freezing committed steps and inpainting the rest, letting any diffusion- or flow-based vision-language-action model run smoothly under latency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces real-time chunking (RTC) as an inference-time algorithm that overlaps generation of the next action chunk with execution of the present chunk. It freezes the initial actions guaranteed to execute and inpaints the uncertain later portion of the chunk, removing the pauses or jerky transitions that normally appear at chunk boundaries when model inference is slow. The method requires no retraining and works on any existing diffusion- or flow-based VLA. Experiments on twelve simulated dynamic tasks and six real bimanual manipulation tasks show higher task throughput and preserved success rates on precise actions such as lighting a match, even when inference delay is large.

Core claim

RTC produces the next action chunk asynchronously by freezing actions that are guaranteed to execute and inpainting the remainder, thereby allowing any diffusion- or flow-based VLA to maintain temporal consistency and high success rates during high-frequency control despite inference latency.

What carries the argument

Real-time chunking (RTC), which overlaps next-chunk generation with current-chunk execution through freezing of committed actions and inpainting of uncertain future actions.

Load-bearing premise

The inpainting of uncertain future actions in the next chunk preserves task-relevant consistency and does not introduce errors that degrade performance on precise or dynamic tasks when inference delay is present.

What would settle it

Running RTC on a precise task such as lighting a match under measured inference delay and observing lower success rate or increased errors compared with synchronous chunk execution on the same hardware would falsify the central claim.

read the original abstract

Modern AI systems, especially those interacting with the physical world, increasingly require real-time performance. However, the high latency of state-of-the-art generalist models, including recent vision-language action models (VLAs), poses a significant challenge. While action chunking has enabled temporal consistency in high-frequency control tasks, it does not fully address the latency problem, leading to pauses or out-of-distribution jerky movements at chunk boundaries. This paper presents a novel inference-time algorithm that enables smooth asynchronous execution of action chunking policies. Our method, real-time chunking (RTC), is applicable to any diffusion- or flow-based VLA out of the box with no re-training. It generates the next action chunk while executing the current one, "freezing" actions guaranteed to execute and "inpainting" the rest. To test RTC, we introduce a new benchmark of 12 highly dynamic tasks in the Kinetix simulator, as well as evaluate 6 challenging real-world bimanual manipulation tasks. Results demonstrate that RTC is fast, performant, and uniquely robust to inference delay, significantly improving task throughput and enabling high success rates in precise tasks $\unicode{x2013}$ such as lighting a match $\unicode{x2013}$ even in the presence of significant latency. See https://pi.website/research/real_time_chunking for videos.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Real-Time Chunking (RTC), an inference-time algorithm for asynchronous execution of action chunking policies in diffusion- or flow-based vision-language-action (VLA) models. RTC generates the next action chunk while the current chunk executes, by freezing actions guaranteed to execute and inpainting the unfrozen tail. The method is claimed to apply out-of-the-box to any such VLA without retraining. It is evaluated on a new 12-task benchmark of highly dynamic tasks in the Kinetix simulator plus 6 real-world bimanual manipulation tasks, with reported gains in task throughput and robustness to inference latency, including high success on precise tasks such as match lighting.

Significance. If the performance and robustness claims are substantiated with quantitative evidence, RTC would address a practical deployment barrier for high-latency generalist VLAs in real-time robotics, enabling smoother control and higher throughput under delay. The Kinetix benchmark could also become a useful community resource for evaluating dynamic manipulation under latency constraints.

major comments (3)
  1. [§3] §3 (RTC algorithm description): the central robustness claim rests on the inpainting step producing task-consistent continuations for the unfrozen tail when early actions are frozen. No analysis, conditioning tests, or ablations are provided to verify that the underlying flow model maintains dynamic consistency in high-precision tasks once latency forces execution of the inpainted actions; this is load-bearing for the 'uniquely robust' assertion.
  2. [Results] Results section and abstract: positive outcomes are stated for the 12-task Kinetix benchmark and 6 real-world tasks, yet no numerical success rates, throughput metrics, baseline comparisons, error bars, or ablation results on the freezing/inpainting components are reported. This absence prevents evaluation of the magnitude or reliability of the claimed improvements.
  3. [§4.1] §4.1 (benchmark description): the Kinetix simulator benchmark is introduced as containing 12 highly dynamic tasks, but task definitions, success metrics, latency simulation protocol, and exact evaluation setup are not detailed, limiting reproducibility of the robustness findings.
minor comments (2)
  1. The abstract renders an en-dash via an unusual unicode sequence; a standard en-dash or hyphen would improve typographic clarity.
  2. [Introduction] Positioning relative to prior action-chunking and asynchronous control literature is brief; adding 2-3 key citations would better contextualize the contribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and will revise the manuscript to incorporate additional analysis, numerical results, and benchmark details as outlined.

read point-by-point responses
  1. Referee: [§3] §3 (RTC algorithm description): the central robustness claim rests on the inpainting step producing task-consistent continuations for the unfrozen tail when early actions are frozen. No analysis, conditioning tests, or ablations are provided to verify that the underlying flow model maintains dynamic consistency in high-precision tasks once latency forces execution of the inpainted actions; this is load-bearing for the 'uniquely robust' assertion.

    Authors: We acknowledge that the robustness claim would be strengthened by explicit verification of the inpainting step. Although the flow model is conditioned on the frozen prefix and generates trajectories consistent with the observed dynamics by design, we agree that dedicated analysis is warranted. In the revised manuscript, we will add conditioning tests and ablations on the inpainting component, including quantitative evaluation of dynamic consistency on high-precision tasks under forced execution of inpainted actions. revision: yes

  2. Referee: [Results] Results section and abstract: positive outcomes are stated for the 12-task Kinetix benchmark and 6 real-world tasks, yet no numerical success rates, throughput metrics, baseline comparisons, error bars, or ablation results on the freezing/inpainting components are reported. This absence prevents evaluation of the magnitude or reliability of the claimed improvements.

    Authors: We apologize for the lack of explicit numerical values in the prose. While the manuscript presents results via figures and tables, we will revise the Results section and abstract to directly report success rates, throughput metrics, baseline comparisons, error bars, and ablation results on the freezing/inpainting components, enabling clearer assessment of the improvements. revision: yes

  3. Referee: [§4.1] §4.1 (benchmark description): the Kinetix simulator benchmark is introduced as containing 12 highly dynamic tasks, but task definitions, success metrics, latency simulation protocol, and exact evaluation setup are not detailed, limiting reproducibility of the robustness findings.

    Authors: We agree that expanded details are required for reproducibility. In the revised manuscript, we will substantially expand §4.1 to include full task definitions, precise success metrics for each of the 12 tasks, the latency simulation protocol, and the complete evaluation setup (including trial counts, randomization, and execution details). revision: yes

Circularity Check

0 steps flagged

No circularity: RTC is a standalone inference-time procedure

full rationale

The paper presents RTC as an algorithmic procedure for asynchronous execution of action chunks in diffusion/flow VLAs. It generates the next chunk while executing the current one by freezing guaranteed actions and inpainting the rest, with no equations, fitted parameters, or derivations shown. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The method is tested empirically on benchmarks rather than derived from self-referential inputs. The central claim of out-of-the-box applicability rests on the model's existing conditioning properties, which are external to the paper and not redefined circularly.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The approach rests on standard properties of diffusion and flow models plus the existing action chunking framework; no new free parameters, axioms, or invented entities are introduced in the abstract.

pith-pipeline@v0.9.0 · 5537 in / 1090 out tokens · 69336 ms · 2026-05-15T14:13:39.045070+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Realtime-VLA FLASH: Speculative Inference Framework for Diffusion-based VLAs

    cs.RO 2026-05 unverdicted novelty 7.0

    A new speculative inference system speeds up diffusion VLAs to 19.1 ms average latency (3.04x faster) on LIBERO by replacing most full 58 ms inferences with 7.8 ms draft rounds while preserving task performance.

  2. RIO: Flexible Real-Time Robot I/O for Cross-Embodiment Robot Learning

    cs.RO 2026-05 unverdicted novelty 7.0

    RIO introduces a lightweight open-source framework that abstracts real-time robot I/O to support easy switching between embodiments and platforms for collecting data and deploying VLAs.

  3. Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models

    cs.RO 2026-05 unverdicted novelty 7.0

    Pace-and-Path Correction decomposes a quadratic cost minimization into orthogonal pace and path channels to correct chunked actions in VLA models, raising success rates by up to 28.8% in dynamic settings.

  4. DiscreteRTC: Discrete Diffusion Policies are Natural Asynchronous Executors

    cs.RO 2026-04 unverdicted novelty 7.0

    Discrete diffusion policies support native asynchronous execution via unmasking for real-time chunking, delivering higher success rates and 0.7x inference cost versus flow-matching RTC on dynamic robotics benchmarks a...

  5. Characterizing Vision-Language-Action Models across XPUs: Constraints and Acceleration for On-Robot Deployment

    cs.RO 2026-04 unverdicted novelty 7.0

    VLA models exhibit a compute-bound VLM phase followed by a memory-bound action phase on edge hardware; DP-Cache and V-AEFusion reduce redundancy and enable pipeline parallelism for up to 6x speedup on NPUs with margin...

  6. ${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities

    cs.LG 2026-04 unverdicted novelty 7.0

    π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.

  7. Generative Control as Optimization: Time Unconditional Flow Matching for Adaptive and Robust Robotic Control

    cs.RO 2026-03 conditional novelty 7.0

    GeCO replaces time-dependent flow matching with time-unconditional optimization, enabling adaptive inference and intrinsic OOD detection for robotic imitation learning.

  8. AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models

    cs.RO 2026-03 unverdicted novelty 7.0

    AR-VLA introduces a standalone autoregressive action expert with long-lived memory that generates context-aware continuous actions for VLAs, replacing chunk-based heads with smoother trajectories and maintained task success.

  9. GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization

    cs.RO 2026-05 unverdicted novelty 6.0

    GuidedVLA improves VLA success rates by manually supervising separate attention heads in the action decoder with auxiliary signals for task-relevant factors.

  10. Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models

    cs.RO 2026-05 unverdicted novelty 6.0

    Pace-and-Path Correction is a closed-form inference-time operator that decomposes a quadratic cost minimization into orthogonal pace compression and path offset channels to correct dynamics-blindness in chunked-action...

  11. Adaptive Action Chunking via Multi-Chunk Q Value Estimation

    cs.LG 2026-05 unverdicted novelty 6.0

    ACH lets RL policies dynamically pick action chunk lengths by jointly estimating Q-values for all candidate lengths via a single Transformer pass.

  12. MotuBrain: An Advanced World Action Model for Robot Control

    cs.RO 2026-04 unverdicted novelty 6.0

    MotuBrain jointly models video and action via a three-stream Mixture-of-Transformers UniDiffuser to reach 95.8-96.1% success on RoboTwin 2.0 benchmarks, top EWMScore, and fast 11 Hz inference while adapting to new rob...

  13. Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising

    cs.RO 2026-04 unverdicted novelty 6.0

    X-WAM unifies robotic action execution and 4D world synthesis by adapting video diffusion priors with a lightweight depth branch and asynchronous noise sampling, achieving 79-91% success on robot benchmarks.

  14. Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising

    cs.RO 2026-04 unverdicted novelty 6.0

    X-WAM unifies real-time robotic action execution with high-fidelity 4D world synthesis by adapting video diffusion priors through lightweight depth branches and asynchronous noise sampling, achieving 79-91% success on...

  15. AsyncShield: A Plug-and-Play Edge Adapter for Asynchronous Cloud-based VLA Navigation

    cs.RO 2026-04 unverdicted novelty 6.0

    AsyncShield restores VLA geometric intent from latency via kinematic pose mapping and uses PPO-Lagrangian to balance tracking with LiDAR safety constraints in a plug-and-play module.

  16. Tube Diffusion Policy: Reactive Visual-Tactile Policy Learning for Contact-rich Manipulation

    cs.RO 2026-04 unverdicted novelty 6.0

    Tube Diffusion Policy learns observation-conditioned feedback flows around nominal action chunks to enable fast reactive control in visual-tactile contact-rich manipulation.

  17. Towards Real-Time Human-AI Musical Co-Performance: Accompaniment Generation with Latent Diffusion Models and MAX/MSP

    cs.SD 2026-04 unverdicted novelty 6.0

    A latent diffusion model with consistency distillation generates real-time instrumental accompaniment from live context audio, integrated with MAX/MSP for feasible human-AI co-performance.

  18. Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model

    cs.RO 2026-04 conditional novelty 6.0

    MV-VDP jointly predicts multi-view RGB and heatmap videos via diffusion to achieve data-efficient, robust robotic manipulation policies.

  19. SERNF: Sample-Efficient Real-World Dexterous Policy Fine-Tuning via Action-Chunked Critics and Normalizing Flows

    cs.RO 2026-02 unverdicted novelty 6.0

    SERNF achieves sample-efficient real-world fine-tuning of multimodal dexterous policies by pairing exact-likelihood normalizing flow policies with action-chunked value critics.

  20. Ctrl-World: A Controllable Generative World Model for Robot Manipulation

    cs.RO 2025-10 unverdicted novelty 6.0

    A controllable world model trained on the DROID dataset generates consistent multi-view robot trajectories for over 20 seconds and improves generalist policy success rates by 44.7% via imagined trajectory fine-tuning.

  21. Understanding Asynchronous Inference Methods for Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 5.0

    Controlled benchmarks show per-step residual correction (A2C2) as most effective for VLA asynchronous inference up to d=8 delays on Kinetix with over 90% solve rate, outperforming inpainting and conditioning while tra...

  22. Causal World Modeling for Robot Control

    cs.CV 2026-01 unverdicted novelty 5.0

    LingBot-VA combines video world modeling with policy learning via Mixture-of-Transformers, closed-loop rollouts, and asynchronous inference to improve robot manipulation in simulation and real settings.

  23. Position: Embodied AI Requires a Privacy-Utility Trade-off

    cs.AI 2026-05 unverdicted novelty 4.0

    Embodied AI requires treating privacy as a lifecycle architectural constraint rather than a stage-local feature, addressed via the proposed SPINE framework with a multi-criterion privacy classification matrix.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · cited by 21 Pith papers · 27 internal anchors

  1. [1]

    Is Conditional Generative Modeling all you need for Decision-Making?

    Anurag Ajay, Yilun Du, Abhi Gupta, Joshua Tenenbaum, Tommi Jaakkola, and Pulkit Agrawal. Is conditional generative modeling all you need for decision-making?arXiv preprint arXiv:2211.15657, 2022

  2. [2]

    Automatic differentiation in machine learning: a survey.Journal of machine learning research, 18(153):1–43, 2018

    Atilim Gunes Baydin, Barak A Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind. Automatic differentiation in machine learning: a survey.Journal of machine learning research, 18(153):1–43, 2018

  3. [3]

    Minivla: A better vla with a smaller footprint, 2024

    Suneel Belkhale and Dorsa Sadigh. Minivla: A better vla with a smaller footprint, 2024. URL https://github.com/Stanford-ILIAD/openvla-mini

  4. [4]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

  5. [5]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024. 10

  6. [6]

    Riemannian flow matching policy for robot motion learning

    Max Braun, Noémie Jaquier, Leonel Rozo, and Tamim Asfour. Riemannian flow matching policy for robot motion learning. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5144–5151. IEEE, 2024

  7. [8]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choro- manski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language- action models transfer web knowledge to robotic control.arXiv preprint arXiv:2307.15818, 2023

  8. [9]

    GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

    Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, Hanbo Zhang, and Minzhao Zhu. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024

  9. [10]

    Navila: Legged robot vision- language-action model for navigation,

    An-Chieh Cheng, Yandong Ji, Zhaojing Yang, Xueyan Zou, Jan Kautz, Erdem Biyik, Hongxu Yin, Sifei Liu, and Xiaolong Wang. NaVILA: Legged Robot Vision-Language-Action Model for Navigation.arXiv preprint arXiv:2412.04453, 2024

  10. [11]

    Diffusion policy: Visuomotor policy learning via action diffusion

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, page 02783649241273668, 2023

  11. [12]

    Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots

    Cheng Chi, Zhenjia Xu, Chuer Pan, Eric Cousineau, Benjamin Burchfiel, Siyuan Feng, Russ Tedrake, and Shuran Song. Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots.arXiv preprint arXiv:2402.10329, 2024

  12. [13]

    Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    OX-Embodiment Collaboration, A Padalkar, A Pooley, A Jain, A Bewley, A Herzog, A Irpan, A Khazatsky, A Rai, A Singh, et al. Open X-Embodiment: Robotic learning datasets and RT-X models.arXiv preprint arXiv:2310.08864, 1(2), 2023

  13. [14]

    Coughanowr and Steven E

    Donald R. Coughanowr and Steven E. LeBlanc.Process Systems Analysis and Control, chap- ter 18. McGraw-Hill, New York, 3rd edition, 2009. ISBN 978-0073397894

  14. [15]

    Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot

    Hao-Shu Fang, Hongjie Fang, Zhenyu Tang, Jirong Liu, Chenxi Wang, Junbo Wang, Haoyi Zhu, and Cewu Lu. Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 653–660. IEEE, 2024

  15. [16]

    At Human Speed: Deep Reinforcement Learning with Action Delay

    Vlad Firoiu, Tina Ju, and Josh Tenenbaum. At human speed: Deep reinforcement learning with action delay.arXiv preprint arXiv:1810.07286, 2018

  16. [17]

    One Step Diffusion via Shortcut Models

    Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models.arXiv preprint arXiv:2410.12557, 2024

  17. [18]

    Murphy, and Tim Salimans

    Ruiqi Gao, Emiel Hoogeboom, Jonathan Heek, Valentin De Bortoli, Kevin P. Murphy, and Tim Salimans. Diffusion meets flow matching: Two sides of the same coin. 2024. URL https://diffusionflow.github.io/

  18. [19]

    One act play: Single demonstration behavior cloning with action chunking transformers.arXiv preprint arXiv:2309.10175, 2023

    Abraham George and Amir Barati Farimani. One act play: Single demonstration behavior cloning with action chunking transformers.arXiv preprint arXiv:2309.10175, 2023. 11

  19. [20]

    Google’s gemini has beaten pokémon blue (with a little help)

    Anthony Ha. Google’s gemini has beaten pokémon blue (with a little help). https://techcrunch.com/2025/05/03/ googles-gemini-has-beaten-pokemon-blue-with-a-little-help/ , May 2025. Accessed May 8, 2025

  20. [21]

    Temporal difference learning for model predictive control.arXiv preprint arXiv:2203.04955,

    Nicklas Hansen, Xiaolong Wang, and Hao Su. Temporal difference learning for model predictive control, 2022. URLhttps://arxiv.org/abs/2203.04955

  21. [22]

    Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

  22. [23]

    Streaming diffusion policy: Fast policy synthesis with variable noise diffusion models.arXiv preprint arXiv:2406.04806, 2024

    Sigmund H Høeg, Yilun Du, and Olav Egeland. Streaming diffusion policy: Fast policy synthesis with variable noise diffusion models.arXiv preprint arXiv:2406.04806, 2024

  23. [24]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. π0.5: A vision- language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

  24. [25]

    Quantization and training of neural networks for efficient integer-arithmetic-only inference

    Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2704–2713, 2018

  25. [26]

    Planning with Diffusion for Flexible Behavior Synthesis

    Michael Janner, Yilun Du, Joshua B Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis.arXiv preprint arXiv:2205.09991, 2022

  26. [27]

    Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning.arXiv preprint arXiv:2410.24185, 2024

    Zhenyu Jiang, Yuqi Xie, Kevin Lin, Zhenjia Xu, Weikang Wan, Ajay Mandlekar, Linxi Fan, and Yuke Zhu. Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning.arXiv preprint arXiv:2410.24185, 2024

  27. [28]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

  28. [29]

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abraham Lee, You...

  29. [30]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  30. [31]

    Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

    Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025. 12

  31. [32]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023

  32. [33]

    Action chunking as policy compression

    Lucy Lai, Ann Zixiang Huang, and Samuel J Gershman. Action chunking as policy compression. 2022

  33. [34]

    Behavior generation with latent actions.arXiv preprint arXiv:2403.03181, 2024

    Seungjae Lee, Yibin Wang, Haritheja Etukuru, H Jin Kim, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. Behavior generation with latent actions.arXiv preprint arXiv:2403.03181, 2024

  34. [35]

    Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of Machine Learning and Systems, 6:87–100, 2024

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of Machine Learning and Systems, 6:87–100, 2024

  35. [36]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

  36. [37]

    RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

    Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024

  37. [38]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

  38. [39]

    Bidirectional decoding: Improving action chunking via closed-loop resampling.arXiv preprint arXiv:2408.17355, 2024

    Yuejiang Liu, Jubayer Ibn Hamid, Annie Xie, Yoonho Lee, Maximilian Du, and Chelsea Finn. Bidirectional decoding: Improving action chunking via closed-loop resampling.arXiv preprint arXiv:2408.17355, 2024

  39. [40]

    Repaint: Inpainting using denoising diffusion probabilistic models

    Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint: Inpainting using denoising diffusion probabilistic models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11461–11471, 2022

  40. [41]

    Roboturk: A crowdsourcing platform for robotic skill learning through imitation

    Ajay Mandlekar, Yuke Zhu, Animesh Garg, Jonathan Booher, Max Spero, Albert Tung, Julian Gao, John Emmons, Anchit Gupta, Emre Orbay, et al. Roboturk: A crowdsourcing platform for robotic skill learning through imitation. InConference on Robot Learning, pages 879–893. PMLR, 2018

  41. [42]

    A variational perspective on solving inverse problems with diffusion models.arXiv preprint arXiv:2305.04391,

    Morteza Mardani, Jiaming Song, Jan Kautz, and Arash Vahdat. A variational perspective on solving inverse problems with diffusion models.arXiv preprint arXiv:2305.04391, 2023

  42. [43]

    Kinetix: Investigating the training of general agents through open-ended physics-based control tasks.arXiv preprint arXiv:2410.23208, 2024

    Michael Matthews, Michael Beukman, Chris Lu, and Jakob Foerster. Kinetix: Investigating the training of general agents through open-ended physics-based control tasks.arXiv preprint arXiv:2410.23208, 2024

  43. [44]

    Quest: Self- supervised skill abstractions for learning continuous control, 2024

    Atharva Mete, Haotian Xue, Albert Wilcox, Yongxin Chen, and Animesh Garg. Quest: Self- supervised skill abstractions for learning continuous control, 2024. URL https://arxiv. org/abs/2407.15840

  44. [45]

    Introducing openai codex, August 2021

    OpenAI. Introducing openai codex, August 2021. URL https://openai.com/index/ introducing-codex/. Accessed on May 27, 2025

  45. [46]

    Imitating human behaviour with diffusion models

    Tim Pearce, Tabish Rashid, Anssi Kanervisto, Dave Bignell, Mingfei Sun, Raluca Georgescu, Sergio Valcarcel Macua, Shan Zheng Tan, Ida Momennejad, Katja Hofmann, et al. Imitating human behaviour with diffusion models.arXiv preprint arXiv:2301.10677, 2023

  46. [47]

    FAST: Efficient Action Tokenization for Vision-Language-Action Models

    Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025. 13

  47. [48]

    Training-free linear image inverses via flows.arXiv preprint arXiv:2310.04432, 2023

    Ashwini Pokle, Matthew J Muckley, Ricky TQ Chen, and Brian Karrer. Training-free linear image inverses via flows.arXiv preprint arXiv:2310.04432, 2023

  48. [49]

    Consistency policy: Accelerated visuomotor policies via consistency distillation

    Aaditya Prasad, Kevin Lin, Jimmy Wu, Linqi Zhou, and Jeannette Bohg. Consistency policy: Accelerated visuomotor policies via consistency distillation.arXiv preprint arXiv:2405.07503, 2024

  49. [50]

    Robust policy optimization in deep reinforcement learning.arXiv preprint arXiv:2212.07536, 2022

    Md Masudur Rahman and Yexiang Xue. Robust policy optimization in deep reinforcement learning.arXiv preprint arXiv:2212.07536, 2022

  50. [51]

    Rawlings, D.Q

    J.B. Rawlings, D.Q. Mayne, and M. Diehl.Model Predictive Control: Theory, Computation, and Design. Nob Hill Publishing, 2017. ISBN 9780975937730. URL https://books.google. ch/books?id=MrJctAEACAAJ

  51. [52]

    Progressive Distillation for Fast Sampling of Diffusion Models

    Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022

  52. [53]

    Real-time neural mpc: Deep learning model predictive control for quadrotors and agile robotic platforms.IEEE Robotics and Automation Letters, 8(4):2397–2404, 2023

    Tim Salzmann, Elia Kaufmann, Jon Arrizabalaga, Marco Pavone, Davide Scaramuzza, and Markus Ryll. Real-time neural mpc: Deep learning model predictive control for quadrotors and agile robotic platforms.IEEE Robotics and Automation Letters, 8(4):2397–2404, 2023

  53. [54]

    Control delay in rein- forcement learning for real-time dynamic systems: A memoryless approach

    Erik Schuitema, Lucian Bu¸ soniu, Robert Babuška, and Pieter Jonker. Control delay in rein- forcement learning for real-time dynamic systems: A memoryless approach. In2010 IEEE/RSJ international conference on intelligent robots and systems, pages 3226–3231. IEEE, 2010

  54. [55]

    Pseudoinverse-guided diffusion models for inverse problems

    Jiaming Song, Arash Vahdat, Morteza Mardani, and Jan Kautz. Pseudoinverse-guided diffusion models for inverse problems. InInternational Conference on Learning Representations, 2023

  55. [56]

    Consistency models

    Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. 2023

  56. [57]

    Sim-to-Real: Learning Agile Locomotion For Quadruped Robots

    Jie Tan, Tingnan Zhang, Erwin Coumans, Atil Iscen, Yunfei Bai, Danijar Hafner, Steven Bohez, and Vincent Vanhoucke. Sim-to-real: Learning agile locomotion for quadruped robots.arXiv preprint arXiv:1804.10332, 2018

  57. [58]

    Gemini Robotics: Bringing AI into the Physical World

    Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montser- rat Gonzalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025

  58. [59]

    Octo: An Open-Source Generalist Robot Policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

  59. [60]

    Lingo-2: Driving with natural language, 2024

    Waywe Research Team et al. Lingo-2: Driving with natural language, 2024

  60. [61]

    Mlp-mixer: An all-mlp architecture for vision.Advances in neural information processing systems, 34: 24261–24272, 2021

    Ilya O Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, et al. Mlp-mixer: An all-mlp architecture for vision.Advances in neural information processing systems, 34: 24261–24272, 2021

  61. [62]

    BridgeData v2: A dataset for robot learning at scale

    Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen- Estruch, Andre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. BridgeData v2: A dataset for robot learning at scale. InConference on Robot Learning, pages 1723–1736. PMLR, 2023

  62. [63]

    Planning and learning in environments with delayed feedback

    Thomas J Walsh, Ali Nouri, Lihong Li, and Michael L Littman. Planning and learning in environments with delayed feedback. InMachine Learning: ECML 2007: 18th European Conference on Machine Learning, Warsaw, Poland, September 17-21, 2007. Proceedings 18, pages 442–453. Springer, 2007

  63. [64]

    Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning

    Zhendong Wang, Jonathan J Hunt, and Mingyuan Zhou. Diffusion policies as an expressive policy class for offline reinforcement learning.arXiv preprint arXiv:2208.06193, 2022

  64. [65]

    BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

    Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516, 2025. 14

  65. [66]

    Thinking while moving: Deep reinforcement learning with concurrent control.arXiv preprint arXiv:2004.06089, 2020

    Ted Xiao, Eric Jang, Dmitry Kalashnikov, Sergey Levine, Julian Ibarz, Karol Hausman, and Alexander Herzog. Thinking while moving: Deep reinforcement learning with concurrent control.arXiv preprint arXiv:2004.06089, 2020

  66. [67]

    Real-time reinforcement learning optimized energy management for a 48v mild hybrid electric vehicle

    Bin Xu, Farzam Malmir, Dhruvang Rathod, and Zoran Filipi. Real-time reinforcement learning optimized energy management for a 48v mild hybrid electric vehicle. Technical report, SAE Technical Paper, 2019

  67. [68]

    Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

    Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023

  68. [69]

    Aloha unleashed: A simple recipe for robot dexterity.arXiv preprint arXiv:2410.13126, 2024

    Tony Z Zhao, Jonathan Tompson, Danny Driess, Pete Florence, Kamyar Ghasemipour, Chelsea Finn, and Ayzaan Wahid. Aloha unleashed: A simple recipe for robot dexterity.arXiv preprint arXiv:2410.13126, 2024

  69. [70]

    3D-VLA: A 3D Vision-Language-Action Generative World Model

    Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, and Chuang Gan. 3d-vla: 3d vision-language-action generative world model.arXiv preprint arXiv:2403.09631, 2024

  70. [71]

    TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

    Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daumé III, Andrey Kolobov, Furong Huang, and Jianwei Yang. Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies.arXiv preprint arXiv:2412.10345, 2024. 15 NeurIPS Paper Checklist 1.Claims Question: Do the main claims made in the abstract and intro...

  71. [72]

    Guidelines: • The answer NA means that the paper does not involve crowdsourcing nor research with human subjects

    Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...