pith. machine review for the scientific record. sign in

arxiv: 2603.19199 · v2 · submitted 2026-03-19 · 💻 cs.RO · cs.CV

Recognition: 2 theorem links

· Lean Theorem

FASTER: Rethinking Real-Time Flow VLAs

Authors on Pith no claims yet

Pith reviewed 2026-05-15 07:59 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords Vision-Language-Actionreal-time roboticsflow matchingaction chunkingdenoising schedulereaction latencystreaming inference
0
0 comments X

The pith

A Horizon-Aware Schedule lets flow-based VLAs complete the first action's denoising in one step instead of many while keeping the full trajectory intact.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that reaction latency in action-chunking VLAs arises because standard flow sampling must finish every denoising step before any movement can begin. By replacing the constant schedule with a Horizon-Aware Schedule, FASTER reallocates sampling effort so that near-term actions are denoised first, allowing the immediate reaction to emerge after a single step. This change is paired with a streaming client-server pipeline that further cuts end-to-end latency on real robots. Experiments on a dynamic table-tennis task confirm that the faster reaction does not visibly degrade long-horizon smoothness or accuracy. The result is a practical reduction in effective reaction time, especially when the policy runs on consumer-grade GPUs.

Core claim

FASTER replaces the fixed denoising schedule of flow-based VLAs with a Horizon-Aware Schedule that adaptively prioritizes near-term actions. The immediate reaction is thereby compressed from roughly ten sampling steps into one step, while the remaining steps continue to refine the longer-horizon portion of the trajectory. Real-robot trials, including a high-speed table-tennis task, demonstrate that the resulting trajectories remain accurate and smooth under streaming execution.

What carries the argument

Horizon-Aware Schedule: an adaptive reordering of flow-matching denoising steps that allocates the first sampling iteration almost entirely to the nearest action chunk.

If this is right

  • Reaction latency drops by roughly an order of magnitude because the first action can be executed after one denoising step.
  • The streaming pipeline allows the robot to begin moving while later denoising steps continue in the background.
  • Consumer-grade GPUs become viable for closed-loop control because the per-step compute budget is reduced.
  • Dynamic tasks such as table tennis become feasible for generalist VLAs without task-specific fine-tuning.
  • Trajectory smoothness is retained because the remaining sampling budget is still applied to the full horizon.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same priority logic could be applied to other iterative sampling policies such as diffusion or consistency models to obtain similar latency gains.
  • If the schedule is made state-dependent rather than purely horizon-dependent, the method might adapt to sudden environmental changes even faster.
  • Extending the approach to multi-robot coordination would require synchronizing the horizon-aware priorities across agents.
  • Longer test horizons than those used in the paper could reveal whether quality degradation appears only after several seconds of execution.

Load-bearing premise

Prioritizing near-term denoising steps will not introduce visible artifacts or instability in the longer part of the trajectory when the policy runs under real-world dynamics.

What would settle it

A controlled robot experiment that runs the same policy for many successive long-horizon executions and records a statistically significant increase in tracking error or oscillation after the first action would falsify the claim that long-horizon quality is preserved.

Figures

Figures reproduced from arXiv: 2603.19199 by Hengshuang Zhao, Jinghua Hou, Junyi Li, Kaixin Ding, Xianzhe Fan, Yuxiang Lu, Zhe Liu, Zhenya Yang.

Figure 1
Figure 1. Figure 1: We propose FASTER to alleviate the reaction latency bottleneck in action chunking flow policies. By compressing the sampling iterations of the immediate reac￾tion into a single step, FASTER (bottom) achieves 10× acceleration compared to original π 0 . 5 and X-VLA (top). This enables real-time responsiveness in highly dynamic tasks such as playing table tennis. FASTER is a plug-and-play solution for flow-ba… view at source ↗
Figure 2
Figure 2. Figure 2: Temporal pipelines of (a) synchronous and (b) asynchronous inference in a robotic system composed of an action chunking policy server and a robot client. As indicated by the best and worst cases, reaction time depends on both inference latency and the interval between consecutive inference-execution cycles. We also illustrate the decomposition of two adjacent action chunks to clarify the discretized infere… view at source ↗
Figure 3
Figure 3. Figure 3: Visualizations of (a) straightness S(A) of the denoising path during sampling of the action chunk, and (b) differences between the intermediate clean action estimates A˜ τ→0 t at each sampling timestep τ and the final output A0 t . 4.2 Pilot Study on Action Chunk Sampling Existing flow-based VLAs treat the entire action chunk as an indivisible unit and apply a constant timestep schedule across all action i… view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of (a) constant timestep schedule used in conventional flow sampling and (b) Horizon-Aware Schedule (HAS) used in FASTER that allocates adaptive hit times across the action chunk and accelerates the sampling of early actions, enabling streaming output. \tilde {\A }_{t}^{\tau \rightarrow 0}=\A _{t}^{\tau }-v_{\theta }(\mathbf {o}_{t}, \A _{t}^{\tau }, \tau )\tau . (4) We measure their deviation… view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of real-world reaction speed on the table tennis task. Left: Visual￾ization of rollouts on RTX 4090, the third column corresponds to the contact moment, and the interval between each image in a row is 166.7ms. Right: Quantitative comple￾tion scores on two GPUs. Pick Beverage Fold Towel 0.75 0.80 0.85 0.90 0.95 1.00 S c o r e 0.879 0.957 0.950 0.957 0.788 0.825 0.888 0.963 Pick Beverage Fold Towe… view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of real-world performance and task completion duration on two additional tasks. accurate motion execution. Specifically, we train VLA to play table tennis us￾ing a racket mounted on a 6-DoF Piper robot arm in the AgileX Cobot Magic platform. We collect approximately 14 minutes of demonstration data via human teleoperation to fine-tune the π0.5 models. In addition, we include two tasks that place… view at source ↗
Figure 7
Figure 7. Figure 7: Temporal pipeline of asynchronous inference at a fine-grained level. Suppose the inference latency ∆tinfer is 2.5 times the controller period ∆tctrl, resulting in an inference delay of d = 2 and a minimal execution horizon of smin = 3. 0 10 20 30 40 50 action index 0.000 0.025 0.050 0.075 0.100 0.125 0.150 straightness S ( A ) (a) 0 10 20 30 40 50 action index 1 2 3 4 5 6 7 8 9 10 ← sampling step (b) 0.00 … view at source ↗
Figure 8
Figure 8. Figure 8: Additional visualizations of (a)(c) straightness S(A) and (b)(d) differences be￾tween the intermediate clean action estimates and the final output. (a)(b) are computed using a π0.5 model fine-tuned on Pick Beverage task with prediction horizon H = 50, while (c)(d) are computed using a model with H = 30. The shadow regions in (a)(c) denote the 5% ∼ 95% percentile range across 200 samples. C Additional Metho… view at source ↗
Figure 9
Figure 9. Figure 9: AgileX Cobot Magic robotic platform with Piper arms. Tasks. We evaluate three real-robot tasks: “Table Tennis”, “Pick Beverage”, and “Fold Towel”. The visualization of Table Tennis task is already provided in the main paper, while illustrations of the other two tasks are shown in [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Visualization of Pick Beverage and Fold Towel tasks. 1. Table Tennis – Step 1: Hitting the table tennis ball with the racket • 0 point: The robot misses the ball. • 0.5 point: The robot returns the ball but produces a weak hit due to reaction latency; the ball travels only a short distance before landing on the table. • 1 point: The robot performs a powerful return, and the ball travels a significant dist… view at source ↗
Figure 11
Figure 11. Figure 11: Hit times used in ablation study, with factor α from 0.4 to 1.0 [PITH_FULL_IMAGE:figures/full_fig_p033_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Comparison of real-world performance and task completion duration on two real-world tasks using X-VLA. Note that the duration is computed only from successful rollouts, and therefore is not directly comparable to the results in [PITH_FULL_IMAGE:figures/full_fig_p034_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Comparison of performance on the Kinetix benchmark under differ￾ent inference delays d, averaged across all feasible execution horizons s [PITH_FULL_IMAGE:figures/full_fig_p035_13.png] view at source ↗
read the original abstract

Real-time execution is crucial for deploying Vision-Language-Action (VLA) models in the physical world. Existing asynchronous inference methods primarily optimize trajectory smoothness, but neglect the critical latency in reacting to environmental changes. By rethinking the notion of reaction in action chunking policies, this paper presents a systematic analysis of the factors governing reaction time. We show that reaction time follows a uniform distribution determined jointly by the Time to First Action (TTFA) and the execution horizon. Moreover, we reveal that the standard practice of applying a constant schedule in flow-based VLAs can be inefficient and forces the system to complete all sampling steps before any movement can start, forming the bottleneck in reaction latency. To overcome this issue, we propose Fast Action Sampling for ImmediaTE Reaction (FASTER). By introducing a Horizon-Aware Schedule, FASTER adaptively prioritizes near-term actions during flow sampling, compressing the denoising of the immediate reaction by tenfold (e.g., in $\pi_{0.5}$ and X-VLA) into a single step, while preserving the quality of long-horizon trajectory. Coupled with a streaming client-server pipeline, FASTER substantially reduces the effective reaction latency on real robots, especially when deployed on consumer-grade GPUs. Real-world experiments, including a highly dynamic table tennis task, prove that FASTER unlocks unprecedented real-time responsiveness for generalist policies, enabling rapid generation of accurate and smooth trajectories.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that reaction time in action-chunking VLAs follows a uniform distribution set by TTFA and execution horizon, that constant flow schedules create an unnecessary bottleneck by requiring full denoising before any motion, and that a Horizon-Aware Schedule in FASTER adaptively prioritizes near-term denoising steps, reducing TTFA by roughly 10× (e.g., in π₀.₅ and X-VLA) into a single step while leaving long-horizon trajectory quality statistically intact. The method is paired with a streaming client-server pipeline and validated on real-robot tasks including dynamic table tennis.

Significance. If the separability assumption holds, the work supplies a practical route to sub-100 ms reaction latency for generalist flow VLAs on consumer GPUs without retraining, directly addressing a deployment barrier that current asynchronous methods have left unaddressed. The uniform-distribution framing of reaction time is a clean conceptual contribution that could be reused beyond flow models.

major comments (2)
  1. [§4] §4 (Horizon-Aware Schedule): the central claim that single-step near-term prioritization leaves the remaining flow trajectory statistically equivalent to full sampling rests on an unproven separability assumption for the learned vector field; no derivation shows that early truncation of the immediate-action component does not propagate inconsistency into later actions under non-uniform action distributions or real dynamics.
  2. [§5] §5 (Real-robot experiments): the reported 10× TTFA compression and quality preservation are load-bearing for the contribution, yet the manuscript provides insufficient detail on statistical controls, per-trajectory variance, baseline schedule tuning procedures, and whether post-hoc hyper-parameter search was performed; without these the empirical support remains inconclusive.
minor comments (2)
  1. [§4.1] Notation for the Horizon-Aware Schedule weights should be introduced with an explicit equation rather than prose description only.
  2. [§3] The uniform-distribution analysis of reaction time would benefit from a short appendix deriving the exact bounds on TTFA and horizon rather than stating the result.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and have prepared revisions to strengthen the manuscript where the points are valid.

read point-by-point responses
  1. Referee: [§4] §4 (Horizon-Aware Schedule): the central claim that single-step near-term prioritization leaves the remaining flow trajectory statistically equivalent to full sampling rests on an unproven separability assumption for the learned vector field; no derivation shows that early truncation of the immediate-action component does not propagate inconsistency into later actions under non-uniform action distributions or real dynamics.

    Authors: We acknowledge that the manuscript does not contain a formal derivation of the separability assumption. The Horizon-Aware Schedule is motivated by the structure of flow matching, where the learned vector field primarily corrects high-frequency noise in early denoising steps that correspond to immediate actions. We provide empirical support through real-robot experiments and ablations showing that long-horizon metrics (success rate, smoothness, collision rate) remain statistically equivalent. In revision we will explicitly label the assumption, add a short derivation sketch based on the flow ODE formulation, and include controlled ablations on synthetic non-uniform action distributions to quantify any propagation effects. revision: yes

  2. Referee: [§5] §5 (Real-robot experiments): the reported 10× TTFA compression and quality preservation are load-bearing for the contribution, yet the manuscript provides insufficient detail on statistical controls, per-trajectory variance, baseline schedule tuning procedures, and whether post-hoc hyper-parameter search was performed; without these the empirical support remains inconclusive.

    Authors: We agree that the experimental section would benefit from greater transparency. The revised manuscript will expand §5 and the appendix to report: number of trials and confidence intervals for each task; per-trajectory variance for TTFA and trajectory error; the exact grid-search procedure used to tune baseline schedules (with the search space listed); and an explicit statement that no post-hoc hyper-parameter search was performed after the initial experimental design. These additions will make the empirical claims more conclusive. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents the Horizon-Aware Schedule as a proposed engineering method to adaptively prioritize near-term denoising steps in flow-based VLAs, with the claimed 10x TTFA compression described as an empirical outcome rather than a mathematical identity. The reaction-time uniform distribution is introduced as an analysis of existing factors (TTFA and execution horizon), not derived from or fitted to the new schedule itself. No equations reduce the latency improvement to a parameter defined by the same data, no self-citations are load-bearing for the core claim, and no ansatz or uniqueness theorem is smuggled in to force the result by construction. The derivation remains self-contained as an applied innovation supported by real-robot experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the horizon-aware schedule is introduced as a new algorithmic choice whose internal hyperparameters are not detailed.

pith-pipeline@v0.9.0 · 5572 in / 1055 out tokens · 25435 ms · 2026-05-15T07:59:15.266339+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. DiscreteRTC: Discrete Diffusion Policies are Natural Asynchronous Executors

    cs.RO 2026-04 unverdicted novelty 7.0

    Discrete diffusion policies support native asynchronous execution via unmasking for real-time chunking, delivering higher success rates and 0.7x inference cost versus flow-matching RTC on dynamic robotics benchmarks a...

  2. LiteVLA-H: Dual-Rate Vision-Language-Action Inference for Onboard Aerial Guidance and Semantic Perception

    cs.CV 2026-04 unverdicted novelty 5.0

    LiteVLA-H delivers 19.74 Hz action tokens and 6 Hz semantic outputs on Jetson Orin via dual-rate scheduling and mixed fine-tuning, outperforming recent VLA baselines in edge action rate while preserving descriptive co...

  3. LiteVLA-H: Dual-Rate Vision-Language-Action Inference for Onboard Aerial Guidance and Semantic Perception

    cs.CV 2026-04 unverdicted novelty 4.0

    LiteVLA-H delivers 50 ms reactive action tokens and 150-165 ms semantic outputs on Jetson AGX Orin by separating fast guidance from slower scene understanding in a compact VLA fine-tuned on mixed aerial and generic data.

Reference graph

Works this paper leans on

115 extracted references · 115 canonical work pages · cited by 2 Pith papers · 14 internal anchors

  1. [1]

    Qwen Technical Report

    Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan, Y., Ge, W., Han, Y., Huang, F., et al.: Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)

  2. [2]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    Bjorck, J., Castañeda, F., Cherniadev, N., Da, X., Ding, R., Fan, L., Fang, Y., Fox, D., Hu, F., Huang, S., et al.: GR00T N1: an open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734 (2025)

  3. [3]

    In: RSS (2025)

    Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al.:π0: A vision-language-action flow model for general robot control. In: RSS (2025)

  4. [4]

    In: NeurIPS (2025)

    Black, K., Galliker, M.Y., Levine, S.: Real-time execution of action chunking flow policies. In: NeurIPS (2025)

  5. [5]

    arXiv preprint arXiv:2512.05964 (2025)

    Black, K., Ren, A.Z., Equi, M., Levine, S.: Training-time action conditioning for efficient real-time chunking. arXiv preprint arXiv:2512.05964 (2025)

  6. [6]

    In: IROS (2025)

    Bu, Q., Cai, J., Chen, L., Cui, X., Ding, Y., Feng, S., He, X., Huang, X., et al.: Agibot world colosseo: A large-scale manipulation platform for scalable and in- telligent embodied systems. In: IROS (2025)

  7. [7]

    arXiv preprint arXiv:2507.14049 (2025)

    Budzianowski, P., Maa, W., Freed, M., Mo, J., Hsiao, W., Xie, A., Młoduchowski, T., Tipnis, V., Bolte, B.: Edgevla: Efficient vision-language-action models. arXiv preprint arXiv:2507.14049 (2025)

  8. [8]

    In: ICLR (2026)

    Cadene, R., Alibert, S., Capuano, F., Aractingi, M., Zouitine, A., Kooijmans, P., Choghari, J., Russi, M., Pascal, C., Palma, S., Shukor, M., Moss, J., Soare, A., Aubakirova, D., Lhoest, Q., Gallouédec, Q., Wolf, T.: Lerobot: An open-source library for end-to-end robot learning. In: ICLR (2026)

  9. [9]

    arXiv preprint arXiv:2602.12684 (2026)

    Cai, R., Guo, J., He, X., Jin, P., Li, J., Lin, B., Liu, F., Liu, W., Ma, F., Ma, K., et al.: Xiaomi-robotics-0: An open-sourced vision-language-action model with real-time execution. arXiv preprint arXiv:2602.12684 (2026)

  10. [10]

    Chen, B., Monsó, D.M., Du, Y., Simchowitz, M., Tedrake, R., Sitzmann, V.: Diffusionforcing:Next-tokenpredictionmeetsfull-sequencediffusion.In:NeurIPS (2024)

  11. [11]

    In: ICML (2025) 16 Y

    Chen, H., Liu, M., Ma, C., Ma, X., Ma, Z., Wu, H., Chen, Y., Zhong, Y., Wang, M., Li, Q., Yang, Y.: Falcon: Fast visuomotor policies via partial denoising. In: ICML (2025) 16 Y. Lu, Z. Liu et al

  12. [12]

    arXiv preprint arXiv:2510.25122 (2025)

    Chen, J., Wang, J., Chen, L., Cai, C., Lu, J.: Nanovla: Routing decoupled vision- language understanding for nano-sized generalist robotic policies. arXiv preprint arXiv:2510.25122 (2025)

  13. [13]

    arXiv preprint arXiv:2506.17639 (2025)

    Chen, Y., Li, X.: Rlrc: Reinforcement learning-based recovery for compressed vision-language-action models. arXiv preprint arXiv:2506.17639 (2025)

  14. [14]

    TMLR (2025)

    Chen, Z., Yuan, X., Mu, T., Su, H.: Responsive noise-relaying diffusion policy: Responsive and efficient visuomotor control. TMLR (2025)

  15. [15]

    In: RSS (2023)

    Chi, C., Feng, S., Du, Y., Xu, Z., Cousineau, E., Burchfiel, B., Song, S.: Diffusion Policy: Visuomotor policy learning via action diffusion. In: RSS (2023)

  16. [16]

    arXiv preprint arXiv:2512.20276 (2025)

    Dai, Y., Gu, H., Wang, T., Cheng, Q., Zheng, Y., Qiu, Z., Gong, L., Lou, W., Zhou, X.: Actionflow: A pipelined action acceleration for vision language models on edge. arXiv preprint arXiv:2512.20276 (2025)

  17. [17]

    IEEE Transactions on robotics (2025)

    Ding, H., Jaquier, N., Peters, J., Rozo, L.: Fast and robust visuomotor riemannian flow matching policy. IEEE Transactions on robotics (2025)

  18. [18]

    In: IROS

    Duan, Y., Yin, H., Kragic, D.: Real-time iteration scheme for diffusion policy. In: IROS. pp. 11758–11764 (2025)

  19. [19]

    Any3D-VLA: Enhancing VLA Robustness via Diverse Point Clouds

    Fan, X., Deng, S., Wu, X., Lu, Y., Li, Z., Yan, M., Zhang, Y., Zhang, Z., Wang, H., Zhao, H.: Any3d-vla: Enhancing vla robustness via diverse point clouds. arXiv preprint arXiv:2602.00807 (2026)

  20. [20]

    arXiv preprint arXiv:2509.09090 (2025)

    Fang, H., Liu, Y., Du, Y., Du, L., Yang, H.: Sqap-vla: A synergistic quantization- aware pruning framework for high-performance vision-language-action models. arXiv preprint arXiv:2509.09090 (2025)

  21. [21]

    In: ICLR (2025)

    Frans, K., Hafner, D., Levine, S., Abbeel, P.: One step diffusion via shortcut models. In: ICLR (2025)

  22. [22]

    In: CoRL (2024)

    Fu, Z., Zhao, T.Z., Finn, C.: Mobile ALOHA: Learning bimanual mobile manip- ulation using low-cost whole-body teleoperation. In: CoRL (2024)

  23. [23]

    arXiv preprint arXiv:2511.18950 (2025)

    Gao, J., Ye, F., Zhang, J., Qian, W.: Compressor-vla: Instruction-guided visual token compression for efficient robotic manipulation. arXiv preprint arXiv:2511.18950 (2025)

  24. [24]

    In: NeurIPS (2025)

    Geng, Z., Deng, M., Bai, X., Kolter, J.Z., He, K.: Mean flows for one-step gener- ative modeling. In: NeurIPS (2025)

  25. [25]

    arXiv preprint arXiv:2510.17111 (2025)

    Guan, W., Hu, Q., Li, A., Cheng, J.: Efficient vision-language-action models for embodied manipulation: A systematic survey. arXiv preprint arXiv:2510.17111 (2025)

  26. [26]

    MiMo-Embodied: X-Embodied Foundation Model Technical Report

    Hao, X., Zhou, L., Huang, Z., Hou, Z., Tang, Y., Zhang, L., Li, G., Lu, Z., Ren, S., Meng, X., et al.: Mimo-embodied: X-embodied foundation model technical report. arXiv preprint arXiv:2511.16518 (2025)

  27. [27]

    In: NeurIPS (2020)

    Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS (2020)

  28. [28]

    arXiv preprint arXiv:2406.04806 (2024)

    Høeg, S.H., Du, Y., Egeland, O.: Streaming diffusion policy: Fast policy synthesis with variable noise diffusion models. arXiv preprint arXiv:2406.04806 (2024)

  29. [29]

    Deepspeed-fastgen: High-throughput text generation for llms via mii and deepspeed-inference.arXiv preprint arXiv:2401.08671, 2024

    Holmes, C., Tanaka, M., Wyatt, M., Awan, A.A., Rasley, J., Rajbhandari, S., Aminabadi, R.Y., Qin, H., Bakhtiari, A., Kurilenko, L., et al.: Deepspeed-fastgen: High-throughput text generation for llms via mii and deepspeed-inference. arXiv preprint arXiv:2401.08671 (2024)

  30. [30]

    arXiv preprint arXiv:2602.00780 (2026) FASTER: Rethinking Real-Time Flow VLAs 17

    Huang, Y., Ding, L., Tang, Z., Zhu, Z., Deng, J., Lin, X., Liu, S., Ren, H., Ji, J., Zhang, Y.: Environment-aware adaptive pruning with interleaved inference orchestration for vision-language-action models. arXiv preprint arXiv:2602.00780 (2026) FASTER: Rethinking Real-Time Flow VLAs 17

  31. [31]

    $\pi^{*}_{0.6}$: a VLA That Learns From Experience

    Intelligence, P., Amin, A., Aniceto, R., Balakrishna, A., Black, K., Conley, K., Connors, G., Darpinian, J., Dhabalia, K., DiCarlo, J., et al.:π∗ 0.6: a vla that learns from experience. arXiv preprint arXiv:2511.14759 (2025)

  32. [32]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    Intelligence, P., Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., et al.:π0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054 (2025)

  33. [33]

    arXiv preprint arXiv:2510.08464 (2025)

    Jabbour, J., Kim, D.K., Smith, M., Patrikar, J., Ghosal, R., Wang, Y., Agha, A., Reddi, V.J., Omidshafiei, S.: Don’t run with scissors: Pruning breaks vla models but they can be recovered. arXiv preprint arXiv:2510.08464 (2025)

  34. [34]

    arXiv preprint arXiv:2601.20262 (2026)

    Jeon, B., Choi, Y., Kim, T.: Shallow-π: Knowledge distillation for flow-based vlas. arXiv preprint arXiv:2601.20262 (2026)

  35. [35]

    arXiv preprint arXiv:2412.09265 (2024)

    Jia, B., Ding, P., Cui, C., Sun, M., Qian, P., Huang, S., Fan, Z., Wang, D.: Score and distribution matching policy: Advanced accelerated visuomotor policies via matched distillation. arXiv preprint arXiv:2412.09265 (2024)

  36. [36]

    arXiv preprint arXiv:2509.12594 (2025)

    Jiang, T., Jiang, X., Ma, Y., Wen, X., Li, B., Zhan, K., Jia, P., Liu, Y., Sun, S., Lang, X.: The better you learn, the smarter you prune: Towards efficient vision-language-action models via differentiable token pruning. arXiv preprint arXiv:2509.12594 (2025)

  37. [37]

    In: RSS (2024)

    Khazatsky, A., Pertsch, K., Nair, S., Balakrishna, A., Dasari, S., Karamcheti, S., Nasiriany, S., Srirama, M.K., Chen, L.Y., Ellis, K., et al.: DROID: A large-scale in-the-wild robot manipulation dataset. In: RSS (2024)

  38. [38]

    In: RSS (2025)

    Kim, M.J., Finn, C., Liang, P.: Fine-tuning vision-language-action models: Opti- mizing speed and success. In: RSS (2025)

  39. [39]

    In: CoRL (2024)

    Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., et al.: OpenVLA: An open-source vision-language-action model. In: CoRL (2024)

  40. [40]

    LLaVA-OneVision: Easy Visual Task Transfer

    Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., et al.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024)

  41. [41]

    arXiv preprint arXiv:2511.10518 (2025)

    Li, W., Zhang, R., Shao, R., Fang, Z., Zhou, K., Tian, Z., Nie, L.: Semanticvla: Semantic-aligned sparsification and enhancement for efficient robotic manipula- tion. arXiv preprint arXiv:2511.10518 (2025)

  42. [42]

    arXiv preprint arXiv:2506.12723 (2025)

    Li, Y., Meng, Y., Sun, Z., Ji, K., Tang, C., Fan, J., Ma, X., Xia, S., Wang, Z., Zhu, W.: Sp-vla: A joint model scheduling and token pruning approach for vla model acceleration. arXiv preprint arXiv:2506.12723 (2025)

  43. [43]

    arXiv preprint arXiv:2508.14042 (2025)

    Li, Z., Wu, X., Xu, Z., Zhao, H.: Train once, deploy anywhere: Realize data- efficient dynamic object manipulation. arXiv preprint arXiv:2508.14042 (2025)

  44. [44]

    arXiv preprint arXiv:2512.07697 (2025)

    Liao, A., Kim, D.K., Smith, M.O., Agha-mohammadi, A.a., Omidshafiei, S.: Delay-aware diffusion policy: Bridging the observation-execution gap in dynamic tasks. arXiv preprint arXiv:2512.07697 (2025)

  45. [45]

    arXiv preprint arXiv:2511.04555 (2025)

    Lin, T., Zhong, Y., Du, Y., Zhang, J., Liu, J., Chen, Y., Gu, E., Liu, Z., Cai, H., Zou, Y., et al.: Evo-1: Lightweight vision-language-action model with preserved semantic alignment. arXiv preprint arXiv:2511.04555 (2025)

  46. [46]

    In: ICLR (2023)

    Lipman, Y., Chen, R.T.Q., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. In: ICLR (2023)

  47. [47]

    Flow Matching Guide and Code

    Lipman, Y., Havasi, M., Holderrieth, P., Shaul, N., Le, M., Karrer, B., Chen, R.T., Lopez-Paz, D., Ben-Hamu, H., Gat, I.: Flow matching guide and code. arXiv preprint arXiv:2412.06264 (2024)

  48. [48]

    In: NeurIPS

    Liu,B.,Zhu,Y.,Gao,C.,Feng,Y.,Liu,Q.,Zhu,Y.,Stone,P.:Libero:Benchmark- ing knowledge transfer for lifelong robot learning. In: NeurIPS. pp. 44776–44791 (2023) 18 Y. Lu, Z. Liu et al

  49. [49]

    In: NeurIPS

    Liu, J., Liu, M., Wang, Z., An, P., Li, X., Zhou, K., Yang, S., Zhang, R., Guo, Y., Zhang, S.: Robomamba: Efficient vision-language-action model for robotic reasoning and manipulation. In: NeurIPS. pp. 40085–40110 (2024)

  50. [50]

    arXiv preprint arXiv:2602.03310 (2026)

    Liu, S., Li, B., Ma, K., Wu, L., Tan, H., Ouyang, X., Su, H., Zhu, J.: RDT2: Exploring the scaling limit of umi data towards zero-shot cross-embodiment gen- eralization. arXiv preprint arXiv:2602.03310 (2026)

  51. [51]

    In: ICLR (2023)

    Liu, X., Gong, C., qiang liu: Flow straight and fast: Learning to generate and transfer data with rectified flow. In: ICLR (2023)

  52. [52]

    In: ICLR (2025)

    Liu, Y., Hamid, J.I., Xie, A., Lee, Y., Du, M., Finn, C.: Bidirectional decoding: Improving action chunking via guided test-time sampling. In: ICLR (2025)

  53. [53]

    arXiv preprint arXiv:2602.12978 (2026)

    Liu, Y., Yu, H., Zhao, J., Li, B., Zhang, D., Li, M., Wu, W., Hu, Y., Xie, J., Guo, J., et al.: Learning native continuation for action chunking flow policies. arXiv preprint arXiv:2602.12978 (2026)

  54. [54]

    In: CVPR (2026)

    Liu, Z., Huang, R., Yang, R., Yan, S., Wang, Z., Hou, L., Lin, D., Bai, X., Zhao, H.: Drivepi: Spatial-aware 4d mllm for unified autonomous driving understanding, perception, prediction and planning. In: CVPR (2026)

  55. [55]

    arXiv preprint arXiv:2511.16449 (2025)

    Liu, Z., Chen, Y., Cai, H., Lin, T., Yang, S., Liu, Z., Zhao, B.: Vla-pruner: Temporal-aware dual-level visual token pruning for efficient vision-language- action inference. arXiv preprint arXiv:2511.16449 (2025)

  56. [56]

    arXiv preprint arXiv:2406.01586 (2024)

    Lu, G., Gao, Z., Chen, T., Dai, W., Wang, Z., Ding, W., Tang, Y.: Manicm: Real- time 3d diffusion policy via consistency model for robotic manipulation. arXiv preprint arXiv:2406.01586 (2024)

  57. [57]

    arXiv preprint arXiv:2512.11769 (2025)

    Ma,X.,Yuan,Z.,Zhang,Z.,Shi,K.,Sun,L.,Ye,Y.:Blurr:Aboostedlow-resource inference for vision-language-action models. arXiv preprint arXiv:2512.11769 (2025)

  58. [58]

    A Survey on Vision-Language-Action Models for Embodied AI

    Ma, Y., Song, Z., Zhuang, Y., Hao, J., King, I.: A survey on vision-language-action models for embodied ai. arXiv preprint arXiv:2405.14093 (2024)

  59. [59]

    arXiv preprint arXiv:2510.26742 (2025)

    Ma, Y., Zhou, Y., Yang, Y., Wang, T., Fan, H.: Running vlas at real-time speed. arXiv preprint arXiv:2510.26742 (2025)

  60. [60]

    In: ICLR (2025)

    Matthews, M., Beukman, M., Lu, C., Foerster, J.N.: Kinetix: Investigating the training of general agents through open-ended physics-based control tasks. In: ICLR (2025)

  61. [61]

    RAL (2022)

    Mees, O., Hermann, L., Rosete-Beas, E., Burgard, W.: CALVIN: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. RAL (2022)

  62. [62]

    arXiv preprint arXiv:2512.00903 (2025)

    Ni, C., Chen, C., Wang, X., Zhu, Z., Zheng, W., Wang, B., Chen, T., Zhao, G., Li, H., Dong, Z., et al.: Swiftvla: Unlocking spatiotemporal dynamics for lightweight vla models at minimal overhead. arXiv preprint arXiv:2512.00903 (2025)

  63. [63]

    In: ICRA (2024)

    Padalkar, A., Pooley, A., Jain, A., Bewley, A., Herzog, A., Irpan, A., Khazatsky, A., Rai, A., Singh, A., Brohan, A., et al.: Open X-Embodiment: Robotic learning datasets and RT-X models. In: ICRA (2024)

  64. [64]

    arXiv preprint arXiv:2412.01034 (2024)

    Park, S., Kim, H., Jeon, W., Yang, J., Jeon, B., Oh, Y., Choi, J.: Quantization- aware imitation-learning for resource-efficient robotic control. arXiv preprint arXiv:2412.01034 (2024)

  65. [65]

    In: ICCV

    Park,S.,Kim,H.,Kim,S.,Jeon,W.,Yang,J.,Jeon,B.,Oh,Y.,Choi,J.:Saliency- aware quantized imitation learning for efficient robotic control. In: ICCV. pp. 13140–13150 (2025)

  66. [66]

    In: ICLR (2026) FASTER: Rethinking Real-Time Flow VLAs 19

    Pei, X., Chen, Y., Xu, S., Wang, Y., Shi, Y., Xu, C.: Action-aware dynamic pruning for efficient vision-language-action manipulation. In: ICLR (2026) FASTER: Rethinking Real-Time Flow VLAs 19

  67. [67]

    FAST: Efficient Action Tokenization for Vision-Language-Action Models

    Pertsch, K., Stachowicz, K., Ichter, B., Driess, D., Nair, S., Vuong, Q., Mees, O., Finn, C., Levine, S.: Fast: Efficient action tokenization for vision-language-action models. arXiv preprint arXiv:2501.09747 (2025)

  68. [68]

    In: CoRL (2025)

    Reuss, M., Zhou, H., Rühle, M., Yağmurlu, Ö.E., Otto, F., Lioutikov, R.: Flower: Democratizing generalist robot policies with efficient vision-language-action flow policies. In: CoRL (2025)

  69. [69]

    arXiv preprint arXiv:2505.04769 (2025)

    Sapkota, R., Cao, Y., Roumeliotis, K.I., Karkee, M.: Vision-language-action (vla) models: Concepts, progress, applications and challenges. arXiv preprint arXiv:2505.04769 (2025)

  70. [70]

    arXiv preprint arXiv:2508.13073 (2025)

    Shao, R., Li, W., Zhang, L., Zhang, R., Liu, Z., Chen, R., Nie, L.: Large vlm-based vision-language-action models for robotic manipulation: A survey. arXiv preprint arXiv:2508.13073 (2025)

  71. [71]

    In: AAAI (2026)

    Sheng, J., Wang, Z., Li, P., Liu, M.: Mp1: Meanflow tames policy learning in 1-step for robotic manipulation. In: AAAI (2026)

  72. [72]

    Shi, M., Chen, L., Chen, J., Lu, Y., Liu, C., Ren, G., Luo, P., Huang, D., Yao, M., Li, H.: Is diversity all you need for scalable robotic manipulation? arXiv preprint arXiv:2507.06219 (2025)

  73. [73]

    SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

    Shukor, M., Aubakirova, D., Capuano, F., Kooijmans, P., Palma, S., Zoui- tine, A., Aractingi, M., Pascal, C., Russi, M., Marafioti, A., et al.: Smolvla: A vision-language-action model for affordable and efficient robotics. arXiv preprint arXiv:2506.01844 (2025)

  74. [74]

    In: CoRL (2025)

    Sochopoulos, A., Malkin, N., Tsagkas, N., Moura, J., Gienger, M., Vijayakumar, S.: Fast flow-based visuomotor policies via conditional optimal transport cou- plings. In: CoRL (2025)

  75. [75]

    In: ICLR (2021)

    Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: ICLR (2021)

  76. [76]

    arXiv preprint arXiv:2506.13725 (2025)

    Song, W., Chen, J., Ding, P., Huang, Y., Zhao, H., Wang, D., Li, H.: Ceed- vla: Consistency vision-language-action model with early-exit decoding. arXiv preprint arXiv:2506.13725 (2025)

  77. [77]

    In: IROS (2025)

    Song, W., Chen, J., Ding, P., Zhao, H., Zhao, W., Zhong, Z., Ge, Z., Ma, J., Li, H.: Accelerating vision-language-action model integrated with action chunking via parallel decoding. In: IROS (2025)

  78. [78]

    In: CVPR

    Sun,M.,Wang,W.,Li,G.,Liu,J.,Sun,J.,Feng,W.,Lao,S.,Zhou,S.,He,Q.,Liu, J.: Ar-diffusion: Asynchronous video generation with auto-regressive diffusion. In: CVPR. pp. 7364–7373 (2025)

  79. [79]

    arXiv preprint arXiv:2509.11480 (2025)

    Taherin, A., Lin, J., Akbari, A., Akbari, A., Zhao, P., Chen, W., Kaeli, D., Wang, Y.: Cross-platform scaling of vision-language-action models from edge to cloud gpus. arXiv preprint arXiv:2509.11480 (2025)

  80. [80]

    arXiv preprint arXiv:2512.01031 (2025)

    Tang, J., Sun, Y., Zhao, Y., Yang, S., Lin, Y., Zhang, Z., Hou, J., Lu, Y., Liu, Z., Han, S.: Vlash: Real-time vlas via future-state-aware asynchronous inference. arXiv preprint arXiv:2512.01031 (2025)

Showing first 80 references.