Real-Time Execution with Autoregressive Policies

Avi Caciularu; Hwasup Lim; Idan Szpektor; Sangkyu Lee; Seohyeon Park; Tackgeun You; Youngjae Yu

arxiv: 2606.13355 · v1 · pith:6E4N3UQ3new · submitted 2026-06-11 · 💻 cs.RO · cs.AI

Real-Time Execution with Autoregressive Policies

Sangkyu Lee , Seohyeon Park , Tackgeun You , Avi Caciularu , Idan Szpektor , Hwasup Lim , Youngjae Yu This is my paper

Pith reviewed 2026-06-27 06:38 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords autoregressive policiesreal-time executionvision-language-action modelsconstrained decodingmulti-trajectory decodingflow-matching policieslatency boundsrobot control

0 comments

The pith

Autoregressive policies achieve real-time execution by shortening tokenization horizons and applying constrained decoding to enforce latency bounds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that autoregressive policies for vision-language-action models can support real-time robot execution. Adjusting the horizon over which tokens are predicted and adding constraints during decoding keeps action generation within fixed time limits. This enables running multiple candidate trajectories in parallel and selecting the best one. A sympathetic reader would care because autoregressive models already show faster training convergence and stronger instruction following, so proving they can also meet strict speed requirements removes a major barrier to their use in physical robots. If correct, the result means these policies can handle both smooth trajectories and quick reactions in simulated and real environments without needing to switch to diffusion-style alternatives.

Core claim

We demonstrate that autoregressive policies can achieve real-time execution by adjusting the tokenization horizon and applying constrained decoding, thereby guaranteeing strict latency bounds that enable multi-trajectory decoding to maximize performance. Across simulated and real-world environments, the autoregressive policy consistently outperforms its equivalent-level flow-matching policy counterpart while achieving significantly improved task completion speeds from synchronous inference. Coupled with the inherent advantages of autoregressive policies, such as faster convergence and better generalizability in instruction-following, these results confirm that autoregressive policies can rem

What carries the argument

Tokenization horizon adjustment combined with constrained decoding, which together enforce strict latency bounds and thereby permit multi-trajectory decoding.

If this is right

The autoregressive policy outperforms equivalent-level flow-matching policies across simulated and real-world tasks.
Task completion speeds increase significantly relative to synchronous inference.
Strict latency bounds become available, enabling multi-trajectory decoding to improve action selection.
Real-time execution remains compatible with the faster convergence and stronger instruction-following properties of autoregressive policies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same horizon and decoding adjustments could be tested on autoregressive policies for non-robotics sequence tasks that also require bounded response times.
Larger autoregressive models might now be deployed in latency-sensitive settings without additional hardware changes.
Hybrid training that mixes autoregressive and flow-matching objectives could be explored to combine their respective strengths under real-time constraints.

Load-bearing premise

Shortening the tokenization horizon and applying constrained decoding preserves policy performance and instruction-following ability without introducing new failure modes.

What would settle it

A direct comparison in which the horizon-adjusted autoregressive policy shows lower task success rates than the flow-matching policy when both are run under identical real-time latency constraints.

Figures

Figures reproduced from arXiv: 2606.13355 by Avi Caciularu, Hwasup Lim, Idan Szpektor, Sangkyu Lee, Seohyeon Park, Tackgeun You, Youngjae Yu.

**Figure 1.** Figure 1: Overview of Real-Time Execution with Autoregressive Policies. Given observation ot and action prefix AQ[0 : m], πθ asynchronously repeats multi-trajectory decoding and constrained decoding to guarantee detokenization and latency constraint dm ≤ m for avoiding pausing behavior. inference due to relatively high latency inherent in sequential decoding. Nevertheless, investigating real-time execution with auto… view at source ↗

**Figure 2.** Figure 2: Analysis of factors influencing the real-time execution with autoregressive policies. (a) Token length distribution of FAST [6] in the LIBERO [24] and DROID [25] datasets. (b) Performance changes of RTC [18] under dm ≤ m in the Kinetix [31] benchmark with 90% confidence intervals. such as FAST [6]. Paradoxically, real-time execution, which can accelerate the rollout speed, is more critical for autoregressi… view at source ↗

**Figure 3.** Figure 3: Ablation study in the LIBERO environment. (a) Average task success rate according to the training step. (b) Average task success rate according to τ of the model with 6k training steps. 4 Experiments 4.1 Experimental Setup Baselines We validate our approach in the single-arm robot manipulation setting, treating π0-FAST [6] as a representative autoregressive policy. We refer to the case where our approach i… view at source ↗

**Figure 4.** Figure 4: Evaluation results on zero-shot DROID deployment. We report 90% confidence intervals. Real-world Environment We use DROID [25] as the real-world environment for zero-shot tabletop manipulation in unseen scenes [6] and evaluate policies fine-tuned on the DROID dataset without additional demonstrations. We configure evaluation suites of 8 tabletop manipulation tasks, with 10 trials per task and a time limit … view at source ↗

**Figure 5.** Figure 5: Qualitative comparison between π0-FAST (top) and π0-REALFAST (bottom). We highlight events when object contact occurs in yellow, given instruction "place the cup on the plate". success to the maximum time limit, with 0 for failure to reflect the rollout speed. We observe that π0-REALFAST significantly surpasses the base model with synchronous inference, π0-FAST, while outperforming π0 and showing performan… view at source ↗

read the original abstract

Real-time execution, enabled by asynchronous inference that ensures both smooth action trajectories and fast reactivity, is critical for realistic deployments of large-scale Vision-Language-Action models. However, recent work on real-time execution primarily focuses on variants of diffusion policies, even though it is more critical for autoregressive policies given their slower rollout speed in synchronous inference. In contrast, we demonstrate that autoregressive policies can achieve real-time execution by adjusting the tokenization horizon and applying constrained decoding, thereby guaranteeing strict latency bounds that enable multi-trajectory decoding to maximize performance. Across simulated and real-world environments, we find that the autoregressive policy consistently outperforms its equivalent-level flow-matching policy counterpart while achieving significantly improved task completion speeds from synchronous inference. Coupled with the inherent advantages of autoregressive policies, such as faster convergence and better generalizability in instruction-following, these results confirm that autoregressive policies can remain a competitive policy type supporting real-time execution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a practical recipe for real-time autoregressive VLA policies via shorter token horizons and constrained decoding, with a direct comparison to flow-matching that looks worth checking in the full experiments.

read the letter

The core contribution is showing that autoregressive policies, which normally lag in synchronous rollout, can hit strict latency bounds by shortening the tokenization horizon and adding constrained decoding. This opens the door to multi-trajectory sampling at inference time without violating real-time constraints. The abstract frames this as addressing a gap that diffusion-focused work has left open for AR models.

What stands out is the direct head-to-head with an equivalent-level flow-matching policy across simulated and real environments, plus the reminder that AR policies already carry advantages in convergence speed and instruction following. If the numbers back the claim that the modified AR version still outperforms while running faster, that is a useful data point for anyone choosing between policy families.

The soft spot is exactly the one the stress-test flags: the abstract asserts that the two changes preserve policy quality and instruction adherence, but supplies no before-and-after metrics on success rate, trajectory quality, or new failure modes. Without those controls it is hard to know whether the latency win comes at an unstated cost. The full paper needs to show the ablation clearly.

This is aimed at robotics researchers who deploy large VLAs on hardware where both reactivity and task performance matter. It deserves a serious referee because the question is practical, the proposed fix is concrete, and the comparison is the right one to run, even if the current evidence level is still low.

Referee Report

2 major / 1 minor

Summary. The paper claims that autoregressive policies for Vision-Language-Action models can achieve real-time execution by adjusting the tokenization horizon and applying constrained decoding to guarantee strict latency bounds. This enables multi-trajectory decoding and yields consistent outperformance over equivalent-level flow-matching policies across simulated and real-world environments, along with significantly improved task completion speeds relative to synchronous inference. The work emphasizes the inherent advantages of autoregressive policies such as faster convergence and better instruction-following generalizability.

Significance. If the empirical claims are substantiated with quantitative evidence, the result would be significant for establishing autoregressive policies as viable for real-time robotic control, leveraging their training and generalization strengths over diffusion-based alternatives.

major comments (2)

[Abstract] Abstract: the claims that the autoregressive policy 'consistently outperforms' its flow-matching counterpart and achieves 'significantly improved task completion speeds' are presented without any quantitative metrics, success rates, latency measurements, statistical tests, or baseline details, preventing assessment of whether the data support the central claim.
[Methods/Results] The central claim requires that horizon adjustment plus constrained decoding simultaneously enforce latency bounds while leaving the policy distribution and instruction-following behavior essentially unchanged; however, no before/after comparison, ablation, or failure-mode analysis is supplied to substantiate that the modifications do not degrade performance or introduce new issues.

minor comments (1)

[Abstract] The abstract would be clearer if it named the specific simulated and real-world environments and the precise flow-matching baseline used for comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed review and constructive comments. We address each major point below and will revise the manuscript to strengthen the presentation of our empirical results.

read point-by-point responses

Referee: [Abstract] Abstract: the claims that the autoregressive policy 'consistently outperforms' its flow-matching counterpart and achieves 'significantly improved task completion speeds' are presented without any quantitative metrics, success rates, latency measurements, statistical tests, or baseline details, preventing assessment of whether the data support the central claim.

Authors: We agree that the abstract would benefit from explicit quantitative support. In the revised manuscript we will incorporate key metrics (success rates, latency bounds, task completion times, and statistical comparisons) drawn from the experimental sections to substantiate the performance claims. revision: yes
Referee: [Methods/Results] The central claim requires that horizon adjustment plus constrained decoding simultaneously enforce latency bounds while leaving the policy distribution and instruction-following behavior essentially unchanged; however, no before/after comparison, ablation, or failure-mode analysis is supplied to substantiate that the modifications do not degrade performance or introduce new issues.

Authors: We acknowledge that explicit verification of distributional invariance is needed. The revised version will add before/after comparisons of policy outputs, ablations isolating the effect of horizon adjustment and constrained decoding on success rates and instruction adherence, and a brief failure-mode analysis to confirm that the latency guarantees are achieved without altering core behavior. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims only

full rationale

The paper presents no equations, derivations, fitted parameters, or mathematical claims that could reduce to their own inputs. All load-bearing assertions rest on empirical comparisons across simulated and real-world environments, with the central modifications (tokenization horizon adjustment and constrained decoding) described as engineering choices whose effects are measured directly rather than derived from prior self-referential steps. No self-citation chains, uniqueness theorems, or ansatzes are invoked to justify the results. This is a standard non-circular empirical robotics paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review based on abstract only; no explicit free parameters, axioms, or invented entities are described in the provided text.

pith-pipeline@v0.9.1-grok · 5707 in / 1049 out tokens · 19002 ms · 2026-06-27T06:38:09.567133+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 24 canonical work pages · 13 internal anchors

[1]

PaLM-E: An Embodied Multimodal Language Model

D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, et al. Palm-e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Zitkovich, T

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023

2023
[3]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al. π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Black, N

K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fu- sai, M. Y . Galliker, et al.π0.5: a vision-language-action model with open-world generalization. In9th Annual Conference on Robot Learning, 2025

2025
[6]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

K. Pertsch, K. Stachowicz, B. Ichter, D. Driess, S. Nair, Q. Vuong, O. Mees, C. Finn, and S. Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

M. J. Kim, C. Finn, and P. Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Bjorck, F

NVIDIA, J. Bjorck, F. Castañeda, N. Cherniadev, X. Da, R. Ding, L. J. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, J. Jang, Z. Jiang, J. Kautz, K. Kundalia, L. Lao, Z. Li, Z. Lin, K. Lin, G. Liu, E. Llontop, L. Magne, A. Mandlekar, A. Narayan, S. Nasiriany, S. Reed, Y . L. Tan, G. Wang, Z. Wang, J. Wang, Q. Wang, J. Xiang, Y . Xie, Y . Xu, Z. Xu, S. Ye, Z. Yu,...

2025
[9]

X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

J. Zheng, J. Li, Z. Wang, D. Liu, X. Kang, Y . Feng, Y . Zheng, J. Zou, Y . Chen, J. Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model. arXiv preprint arXiv:2510.10274, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

GR-3 Technical Report

C. Cheang, S. Chen, Z. Cui, Y . Hu, L. Huang, T. Kong, H. Li, Y . Li, Y . Liu, X. Ma, et al. Gr-3 technical report.arXiv preprint arXiv:2507.15493, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

J. Cai, Z. Cai, J. Cao, Y . Chen, Z. He, L. Jiang, H. Li, H. Li, Y . Li, Y . Liu, et al. Internvla- a1: Unifying understanding, generation and action for robotic manipulation.arXiv preprint arXiv:2601.02456, 2026

work page arXiv 2026
[12]

Q. Zhao, Y . Lu, M. J. Kim, Z. Fu, Z. Zhang, Y . Wu, Z. Li, Q. Ma, S. Han, C. Finn, A. Handa, M.-Y . Liu, D. Xiang, G. Wetzstein, and T.-Y . Lin. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025
[13]

Q. Li, Y . Liang, Z. Wang, L. Luo, X. Chen, M. Liao, F. Wei, Y . Deng, S. Xu, Y . Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware. InProceedings of Robotics: Science and Systems, Daegu, Republic of Korea, July 2023. doi:10.15607/RSS.2023.XIX.016. 9

work page doi:10.15607/rss.2023.xix.016 2023
[15]

Y . Liu, J. Hamid, A. Xie, Y . Lee, M. Du, and C. Finn. Bidirectional decoding: Improving action chunking via guided test-time sampling. InInternational Conference on Learning Representations, volume 2025, pages 4594–4627, 2025

2025
[16]

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

M. Shukor, D. Aubakirova, F. Capuano, P. Kooijmans, S. Palma, A. Zouitine, M. Aractingi, C. Pascal, M. Russi, A. Marafioti, et al. Smolvla: A vision-language-action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

T. Xiao, E. Jang, D. Kalashnikov, S. Levine, J. Ibarz, K. Hausman, and A. Herzog. Think- ing while moving: Deep reinforcement learning with concurrent control.arXiv preprint arXiv:2004.06089, 2020

work page arXiv 2004
[18]

Black, M

K. Black, M. Galliker, and S. Levine. Real-time execution of action chunking flow policies. Advances in Neural Information Processing Systems, 38:33383–33407, 2026

2026
[19]

Sendai, M

K. Sendai, M. Alvarez, T. Matsushima, Y . Matsuo, and Y . Iwasawa. Leave no observation behind: Real-time correction for vla action chunks.arXiv preprint arXiv:2509.23224, 2025

work page arXiv 2025
[20]

J. Tang, Y . Sun, Y . Zhao, S. Yang, Y . Lin, Z. Zhang, J. Hou, Y . Lu, Z. Liu, and S. Han. Vlash: Real-time vlas via future-state-aware asynchronous inference.arXiv preprint arXiv:2512.01031, 2025

work page arXiv 2025
[21]

Training-Time Action Conditioning for Efficient Real- Time Chunking,

K. Black, A. Z. Ren, M. Equi, and S. Levine. Training-time action conditioning for efficient real-time chunking.arXiv preprint arXiv:2512.05964, 2025

work page arXiv 2025
[22]

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

2025
[23]

S. H. Høeg, Y . Du, and O. Egeland. Streaming diffusion policy: Fast policy synthesis with variable noise diffusion models.arXiv preprint arXiv:2406.04806, 2024

work page arXiv 2024
[24]

B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36: 44776–44791, 2023

2023
[25]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

RT-1: Robotics Transformer for Real-World Control at Scale

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[27]

Y . Liu, S. Zhang, Z. Dong, B. Ye, T. Yuan, X. Yu, L. Yin, C. Lu, J. Shi, L. J.-T. Yu, et al. Faster: Toward efficient autoregressive vision language action modeling via neural action tokenization. arXiv preprint arXiv:2512.04952, 2025

work page arXiv 2025
[28]

H. Zhou, W. Liao, X. Huang, Y . Tang, F. Otto, X. Jia, X. Jiang, S. Hilber, G. Li, Q. Wang, et al. Beast: Efficient tokenization of b-splines encoded action sequences for imitation learning. Advances in Neural Information Processing Systems, 38:172934–172959, 2026

2026
[29]

C. Liu, X. Han, J. Gao, Y . Zhao, H. Chen, and Y . Du. Oat: Ordered action tokenization, 2026. URLhttps://arxiv.org/abs/2602.04215

work page arXiv 2026
[30]

E. W. Dijkstra. Cooperating sequential processes. In F. Genuys, editor,Programming Languages, pages 43–112. Academic Press, New York, 1968. 10

1968
[31]

Matthews, M

M. Matthews, M. Beukman, C. Lu, and J. Foerster. Kinetix: Investigating the training of general agents through open-ended physics-based control tasks. InInternational Conference on Learning Representations, volume 2025, pages 58515–58564, 2025

2025
[32]

Torne, K

M. Torne, K. Pertsch, H. Walke, K. Vedder, S. Nair, B. Ichter, A. Z. Ren, H. Wang, J. Tang, K. Stachowicz, et al. Mem: Multi-scale embodied memory for vision language action models. arXiv preprint arXiv:2603.03596, 2026

work page arXiv 2026
[33]

Hokamp and Q

C. Hokamp and Q. Liu. Lexically constrained decoding for sequence generation using grid beam search. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1535–1546, 2017

2017
[34]

Post and D

M. Post and D. Vilar. Fast lexically constrained decoding with dynamic beam allocation for neural machine translation. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1314–1324, 2018

2018
[35]

Deutsch, S

D. Deutsch, S. Upadhyay, and D. Roth. A general-purpose algorithm for constrained sequential inference. InProceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 482–492, 2019

2019
[36]

S. Jang, D. Kim, C. Kim, Y . Kim, and J. Shin. Verifier-free test-time sampling for vision language action models.arXiv preprint arXiv:2510.05681, 2025

work page arXiv 2025
[37]

R. Pope, S. Douglas, A. Chowdhery, J. Devlin, J. Bradbury, J. Heek, K. Xiao, S. Agrawal, and J. Dean. Efficiently scaling transformer inference.Proceedings of machine learning and systems, 5:606–624, 2023

2023
[38]

PaliGemma: A versatile 3B VLM for transfer

L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neumann, I. Alabdul- mohsin, M. Tschannen, E. Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

Driess, J

D. Driess, J. Springenberg, B. Ichter, L. Yu, A. Li-Bell, K. Pertsch, A. Ren, H. Walke, Q. Vuong, L. X. Shi, et al. Knowledge insulating vision-language-action models: Train fast, run fast, generalize better.Advances in Neural Information Processing Systems, 38:102867–102888, 2026

2026
[40]

Leviathan, M

Y . Leviathan, M. Kalman, and Y . Matias. Fast inference from transformers via speculative decoding. InInternational Conference on Machine Learning, pages 19274–19286. PMLR, 2023

2023
[41]

C. Wang, K. Cho, and J. Gu. Neural machine translation with byte-level subwords. In Proceedings of the AAAI conference on artificial intelligence, pages 9154–9160, 2020

2020
[42]

Sennrich, B

R. Sennrich, B. Haddow, and A. Birch. Neural machine translation of rare words with subword units. InProceedings of the 54th annual meeting of the association for computational linguistics (volume 1: long papers), pages 1715–1725, 2016

2016
[43]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 11 A Implementation Details A.1 Constrained Decoding under FAST with Byte-level BPE Tokenization Throughout all experiments, we have implemented our approach using JAX based on theopenpi4 repository to adopt π0-FAST [6] as the base model for ourπ0-REA...

work page internal anchor Pith review Pith/arXiv arXiv 2017

[1] [1]

PaLM-E: An Embodied Multimodal Language Model

D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, et al. Palm-e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Zitkovich, T

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023

2023

[3] [3]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al. π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Black, N

K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fu- sai, M. Y . Galliker, et al.π0.5: a vision-language-action model with open-world generalization. In9th Annual Conference on Robot Learning, 2025

2025

[6] [6]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

K. Pertsch, K. Stachowicz, B. Ichter, D. Driess, S. Nair, Q. Vuong, O. Mees, C. Finn, and S. Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

M. J. Kim, C. Finn, and P. Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Bjorck, F

NVIDIA, J. Bjorck, F. Castañeda, N. Cherniadev, X. Da, R. Ding, L. J. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, J. Jang, Z. Jiang, J. Kautz, K. Kundalia, L. Lao, Z. Li, Z. Lin, K. Lin, G. Liu, E. Llontop, L. Magne, A. Mandlekar, A. Narayan, S. Nasiriany, S. Reed, Y . L. Tan, G. Wang, Z. Wang, J. Wang, Q. Wang, J. Xiang, Y . Xie, Y . Xu, Z. Xu, S. Ye, Z. Yu,...

2025

[9] [9]

X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

J. Zheng, J. Li, Z. Wang, D. Liu, X. Kang, Y . Feng, Y . Zheng, J. Zou, Y . Chen, J. Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model. arXiv preprint arXiv:2510.10274, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

GR-3 Technical Report

C. Cheang, S. Chen, Z. Cui, Y . Hu, L. Huang, T. Kong, H. Li, Y . Li, Y . Liu, X. Ma, et al. Gr-3 technical report.arXiv preprint arXiv:2507.15493, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

J. Cai, Z. Cai, J. Cao, Y . Chen, Z. He, L. Jiang, H. Li, H. Li, Y . Li, Y . Liu, et al. Internvla- a1: Unifying understanding, generation and action for robotic manipulation.arXiv preprint arXiv:2601.02456, 2026

work page arXiv 2026

[12] [12]

Q. Zhao, Y . Lu, M. J. Kim, Z. Fu, Z. Zhang, Y . Wu, Z. Li, Q. Ma, S. Han, C. Finn, A. Handa, M.-Y . Liu, D. Xiang, G. Wetzstein, and T.-Y . Lin. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025

[13] [13]

Q. Li, Y . Liang, Z. Wang, L. Luo, X. Chen, M. Liao, F. Wei, Y . Deng, S. Xu, Y . Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware. InProceedings of Robotics: Science and Systems, Daegu, Republic of Korea, July 2023. doi:10.15607/RSS.2023.XIX.016. 9

work page doi:10.15607/rss.2023.xix.016 2023

[15] [15]

Y . Liu, J. Hamid, A. Xie, Y . Lee, M. Du, and C. Finn. Bidirectional decoding: Improving action chunking via guided test-time sampling. InInternational Conference on Learning Representations, volume 2025, pages 4594–4627, 2025

2025

[16] [16]

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

M. Shukor, D. Aubakirova, F. Capuano, P. Kooijmans, S. Palma, A. Zouitine, M. Aractingi, C. Pascal, M. Russi, A. Marafioti, et al. Smolvla: A vision-language-action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

T. Xiao, E. Jang, D. Kalashnikov, S. Levine, J. Ibarz, K. Hausman, and A. Herzog. Think- ing while moving: Deep reinforcement learning with concurrent control.arXiv preprint arXiv:2004.06089, 2020

work page arXiv 2004

[18] [18]

Black, M

K. Black, M. Galliker, and S. Levine. Real-time execution of action chunking flow policies. Advances in Neural Information Processing Systems, 38:33383–33407, 2026

2026

[19] [19]

Sendai, M

K. Sendai, M. Alvarez, T. Matsushima, Y . Matsuo, and Y . Iwasawa. Leave no observation behind: Real-time correction for vla action chunks.arXiv preprint arXiv:2509.23224, 2025

work page arXiv 2025

[20] [20]

J. Tang, Y . Sun, Y . Zhao, S. Yang, Y . Lin, Z. Zhang, J. Hou, Y . Lu, Z. Liu, and S. Han. Vlash: Real-time vlas via future-state-aware asynchronous inference.arXiv preprint arXiv:2512.01031, 2025

work page arXiv 2025

[21] [21]

Training-Time Action Conditioning for Efficient Real- Time Chunking,

K. Black, A. Z. Ren, M. Equi, and S. Levine. Training-time action conditioning for efficient real-time chunking.arXiv preprint arXiv:2512.05964, 2025

work page arXiv 2025

[22] [22]

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

2025

[23] [23]

S. H. Høeg, Y . Du, and O. Egeland. Streaming diffusion policy: Fast policy synthesis with variable noise diffusion models.arXiv preprint arXiv:2406.04806, 2024

work page arXiv 2024

[24] [24]

B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36: 44776–44791, 2023

2023

[25] [25]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

RT-1: Robotics Transformer for Real-World Control at Scale

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[27] [27]

Y . Liu, S. Zhang, Z. Dong, B. Ye, T. Yuan, X. Yu, L. Yin, C. Lu, J. Shi, L. J.-T. Yu, et al. Faster: Toward efficient autoregressive vision language action modeling via neural action tokenization. arXiv preprint arXiv:2512.04952, 2025

work page arXiv 2025

[28] [28]

H. Zhou, W. Liao, X. Huang, Y . Tang, F. Otto, X. Jia, X. Jiang, S. Hilber, G. Li, Q. Wang, et al. Beast: Efficient tokenization of b-splines encoded action sequences for imitation learning. Advances in Neural Information Processing Systems, 38:172934–172959, 2026

2026

[29] [29]

C. Liu, X. Han, J. Gao, Y . Zhao, H. Chen, and Y . Du. Oat: Ordered action tokenization, 2026. URLhttps://arxiv.org/abs/2602.04215

work page arXiv 2026

[30] [30]

E. W. Dijkstra. Cooperating sequential processes. In F. Genuys, editor,Programming Languages, pages 43–112. Academic Press, New York, 1968. 10

1968

[31] [31]

Matthews, M

M. Matthews, M. Beukman, C. Lu, and J. Foerster. Kinetix: Investigating the training of general agents through open-ended physics-based control tasks. InInternational Conference on Learning Representations, volume 2025, pages 58515–58564, 2025

2025

[32] [32]

Torne, K

M. Torne, K. Pertsch, H. Walke, K. Vedder, S. Nair, B. Ichter, A. Z. Ren, H. Wang, J. Tang, K. Stachowicz, et al. Mem: Multi-scale embodied memory for vision language action models. arXiv preprint arXiv:2603.03596, 2026

work page arXiv 2026

[33] [33]

Hokamp and Q

C. Hokamp and Q. Liu. Lexically constrained decoding for sequence generation using grid beam search. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1535–1546, 2017

2017

[34] [34]

Post and D

M. Post and D. Vilar. Fast lexically constrained decoding with dynamic beam allocation for neural machine translation. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1314–1324, 2018

2018

[35] [35]

Deutsch, S

D. Deutsch, S. Upadhyay, and D. Roth. A general-purpose algorithm for constrained sequential inference. InProceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 482–492, 2019

2019

[36] [36]

S. Jang, D. Kim, C. Kim, Y . Kim, and J. Shin. Verifier-free test-time sampling for vision language action models.arXiv preprint arXiv:2510.05681, 2025

work page arXiv 2025

[37] [37]

R. Pope, S. Douglas, A. Chowdhery, J. Devlin, J. Bradbury, J. Heek, K. Xiao, S. Agrawal, and J. Dean. Efficiently scaling transformer inference.Proceedings of machine learning and systems, 5:606–624, 2023

2023

[38] [38]

PaliGemma: A versatile 3B VLM for transfer

L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neumann, I. Alabdul- mohsin, M. Tschannen, E. Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[39] [39]

Driess, J

D. Driess, J. Springenberg, B. Ichter, L. Yu, A. Li-Bell, K. Pertsch, A. Ren, H. Walke, Q. Vuong, L. X. Shi, et al. Knowledge insulating vision-language-action models: Train fast, run fast, generalize better.Advances in Neural Information Processing Systems, 38:102867–102888, 2026

2026

[40] [40]

Leviathan, M

Y . Leviathan, M. Kalman, and Y . Matias. Fast inference from transformers via speculative decoding. InInternational Conference on Machine Learning, pages 19274–19286. PMLR, 2023

2023

[41] [41]

C. Wang, K. Cho, and J. Gu. Neural machine translation with byte-level subwords. In Proceedings of the AAAI conference on artificial intelligence, pages 9154–9160, 2020

2020

[42] [42]

Sennrich, B

R. Sennrich, B. Haddow, and A. Birch. Neural machine translation of rare words with subword units. InProceedings of the 54th annual meeting of the association for computational linguistics (volume 1: long papers), pages 1715–1725, 2016

2016

[43] [43]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 11 A Implementation Details A.1 Constrained Decoding under FAST with Byte-level BPE Tokenization Throughout all experiments, we have implemented our approach using JAX based on theopenpi4 repository to adopt π0-FAST [6] as the base model for ourπ0-REA...

work page internal anchor Pith review Pith/arXiv arXiv 2017