DAM-VLA: Decoupled Asynchronous Multimodal Vision Language Action model

Gianluca Geraci; Jakub Suliga; Moritz Reuss; Pankhuri Vanjani; Rudolf Lioutikov; Xinkai Jiang; Zhuoyue Li

arxiv: 2606.12105 · v1 · pith:LRE4WLRKnew · submitted 2026-06-10 · 💻 cs.RO · cs.CV· cs.LG

DAM-VLA: Decoupled Asynchronous Multimodal Vision Language Action model

Pankhuri Vanjani , Zhuoyue Li , Jakub Suliga , Moritz Reuss , Gianluca Geraci , Xinkai Jiang , Rudolf Lioutikov This is my paper

Pith reviewed 2026-06-27 09:24 UTC · model grok-4.3

classification 💻 cs.RO cs.CVcs.LG

keywords vision-language-actionasynchronous multimodaldecoupled temporal processingrobot manipulationcontact-rich taskslatent buffersgated cross-attentionhigh-frequency control

0 comments

The pith

Decoupling each modality's update rate in a vision-language-action model more than doubles success on contact-rich manipulation tasks while enabling 100 Hz control.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language-action models normally inherit a single synchronous clock from pretraining and process every input at one fixed rate. This creates mismatch with physical robots, where proprioception or force can change at hundreds of hertz, vision updates more slowly, and language remains constant. The paper tests the claim that letting each modality maintain its own latent buffer refreshed at sensor rate, read continuously by the action head, will yield stronger representations. Integration of new high-frequency streams occurs through gated cross-attention that leaves the original pretrained backbone unchanged. On seven real-world contact-rich tasks the resulting model reaches 95.2 percent average success versus 40.95 percent for the strongest synchronous baseline and sustains smooth reactive control at 100 Hz.

Core claim

DAM-VLA maintains independent per-modality latent buffers that are refreshed at each sensor's native rate and read on demand by the action head. New high-frequency modalities are incorporated via gated cross-attention without modifying the pretrained vision-language backbone. This produces stronger multimodal representations for contact-rich control and removes the frequency cap imposed by the slowest modality.

What carries the argument

Per-modality latent buffers refreshed at sensor rates and read continuously by the action head, combined with gated cross-attention for adding high-frequency inputs.

If this is right

Average success rate on the seven tasks rises from 40.95 percent to 95.2 percent.
Control frequency remains smooth and reactive at 100 Hz without undersampling fast modalities.
The pretrained vision-language backbone stays intact while high-frequency modalities are added.
Oversampling of slow modalities and undersampling of fast ones are both avoided.
Action generation is no longer limited by the lowest effective input frequency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same buffering approach could be tested on tasks with longer time horizons to check whether independent modality memories improve retention of distant events.
Energy or compute savings may arise from avoiding repeated processing of slow-changing inputs, though this is not measured in the paper.
The architecture could be applied to other multimodal robot policies where input rates differ by orders of magnitude.

Load-bearing premise

That independent per-modality buffers updated at native sensor rates will produce stronger representations and more robust control without degrading the pretrained backbone.

What would settle it

Running DAM-VLA and the strongest synchronous baseline on the same seven tasks and finding that the decoupled model does not exceed 40.95 percent average success or cannot sustain 100 Hz reactive control.

Figures

Figures reproduced from arXiv: 2606.12105 by Gianluca Geraci, Jakub Suliga, Moritz Reuss, Pankhuri Vanjani, Rudolf Lioutikov, Xinkai Jiang, Zhuoyue Li.

**Figure 2.** Figure 2: DAM-VLA architecture. Each modality stream encodes tokens into independent latent buffers at their sensor rate: vision periodically, proprioception and force/torque at high frequency. The action expert reads all buffers continuously via parallel GCA pathways, a global-gate pathway for visual memory and an input-dependent gate pathway for force/torque, adding new modalities through dedicated cross-attention… view at source ↗

**Figure 3.** Figure 3: Qualitative rollout comparison across a subset of manipulation tasks. green boxes indicate [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Success rates across different tasks and model configurations. Blue indicates clean success, [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Handwash execution wrench: DAM-VLA makes a single clean press, whereas X-VLAAFM repeatedly presses and retracts (multipress) over a longer episode. tion additional modalities can build on, each contributing independently. Memory alone provides meaningful gains over the decoupled baseline. DAM-VLA/F achieves 100% on scarf and sweep and 86.7% on button press, but falls short on contact-critical tasks: partia… view at source ↗

**Figure 6.** Figure 6: Command smoothness on Sweep using 7D joint commands. Only 100 Hz configurations [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Mean tracking lag comparison on Sweep. Tracking lag measures the estimated temporal [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Execution times across different tasks and model configurations. Hatched bars indicate [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

read the original abstract

Vision-language-action (VLA) models inherit a shared synchronous clock from vision-language pretraining, processing every input at one rate. This is misaligned with physical interaction, where a high-frequency modality changes at hundreds of hertz, vision evolves more slowly, and language stays constant across an episode. A synchronous VLA oversamples slow modalities, undersamples fast ones, and caps action generation at the lowest effective frequency. We hypothesize that decoupling temporal processing per modality, letting each update and retain information at its own sensor rate, yields stronger representations and more robust control. We present DAM-VLA, which maintains per-modality latent buffers refreshed at sensor rates and read continuously by the action head, integrating new high-frequency modalities through gated cross-attention that leaves the pretrained backbone intact. Across seven contact-rich real-world manipulation tasks, DAM-VLA more than doubles the average success rate of the strongest synchronous baseline (95.2\% vs.\ 40.95\%) while sustaining smooth, reactive 100\,Hz control. Project website: \href{https://intuitive-robots.github.io/DAM-VLA/}{intuitive-robots.github.io/DAM-VLA/}

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DAM-VLA shows large real-robot success gains from per-modality asynchronous buffers but bundles multiple changes so the decoupling itself is not isolated.

read the letter

The main takeaway is that DAM-VLA more than doubles average success on seven contact-rich real-robot tasks (95.2% vs 40.95%) while running smooth 100 Hz control, using per-modality latent buffers refreshed at native sensor rates plus gated cross-attention that leaves the pretrained backbone untouched.

The architecture is new in its explicit decoupling: each modality keeps its own buffer updated at its own rate instead of forcing everything onto a single synchronous clock. The action head reads continuously from these buffers, and high-frequency signals enter through gated cross-attention. This directly targets the mismatch between vision-language pretraining rates and physical sensor rates.

The paper does a solid job showing that the approach works on hardware. Real-world manipulation results carry more weight than simulation numbers, and preserving the backbone is a practical choice that avoids retraining costs.

The soft spot is the missing isolation of the decoupling effect. The design adds independent buffers, gated cross-attention, and a continuously reading action head together. No ablations appear to hold the other pieces fixed while varying only the temporal decoupling, so the jump in performance could come from added capacity or the new integration path rather than asynchrony alone. The abstract also omits task definitions, baseline implementation details, and statistical tests, which leaves the numbers harder to assess.

This work is for researchers building or deploying VLAs on physical robots, especially where reactivity and contact-rich behavior matter. Readers focused on multimodal timing or hardware constraints will find the empirical demonstration useful.

It deserves peer review. The core idea addresses a real deployment limitation and the robot results are concrete enough to warrant referee time, even if the causal attribution needs tighter evidence in revision.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes DAM-VLA, a decoupled asynchronous multimodal vision-language-action model. It argues that standard VLAs inherit a shared synchronous clock misaligned with physical sensors (high-frequency modalities at hundreds of Hz, slower vision, constant language), and introduces per-modality latent buffers refreshed at sensor rates, integrated via gated cross-attention that leaves the pretrained backbone intact, with a continuously-reading action head. On seven contact-rich real-world manipulation tasks, it reports more than doubling average success rate versus the strongest synchronous baseline (95.2% vs. 40.95%) while sustaining 100 Hz reactive control.

Significance. If the empirical gains hold and can be attributed specifically to temporal decoupling, the approach could meaningfully advance real-time VLA deployment in robotics by aligning processing with modality dynamics while preserving pretrained models. Real-robot results on contact-rich tasks are a positive aspect of the evaluation.

major comments (2)

[Abstract] Abstract: The central claim attributes the jump from 40.95% to 95.2% success rate to the hypothesis that 'decoupling temporal processing per modality, letting each update and retain information at its own sensor rate, yields stronger representations and more robust control.' However, the architecture simultaneously adds gated cross-attention, per-modality buffers, and a new action head. No ablation holding these constant under synchronous timing is described, so the results do not establish that asynchrony itself (rather than added capacity or the integration mechanism) drives the reported gains.
[Experiments] Experiments section (results on the seven tasks): The synchronous baseline is referred to only as 'the strongest synchronous baseline' without explicit confirmation that it incorporates the same gated cross-attention and buffer mechanisms under a single clock. This comparison is load-bearing for the claim that the decoupled design produces the observed improvement.

minor comments (1)

The manuscript should include explicit task definitions, baseline implementation details, number of trials per task, and any statistical tests or exclusion criteria to support the reported success rates.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address the two major comments point by point below, clarifying the design rationale while acknowledging where additional discussion or revisions will strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim attributes the jump from 40.95% to 95.2% success rate to the hypothesis that 'decoupling temporal processing per modality, letting each update and retain information at its own sensor rate, yields stronger representations and more robust control.' However, the architecture simultaneously adds gated cross-attention, per-modality buffers, and a new action head. No ablation holding these constant under synchronous timing is described, so the results do not establish that asynchrony itself (rather than added capacity or the integration mechanism) drives the reported gains.

Authors: We agree that a controlled ablation isolating asynchrony from the integration mechanisms would provide stronger evidence. The gated cross-attention, per-modality latent buffers, and continuously reading action head are not independent additions but the concrete realization of the decoupling hypothesis; they enable each modality to update and retain information at its native sensor rate while the backbone remains frozen. The synchronous baseline is the standard pretrained VLA operating under a single shared clock without these asynchronous mechanisms. We will revise the abstract and introduction to temper the causal language, explicitly note that the reported gains reflect the full decoupled architecture, and add a limitations paragraph acknowledging the absence of a same-mechanism synchronous ablation. revision: partial
Referee: [Experiments] Experiments section (results on the seven tasks): The synchronous baseline is referred to only as 'the strongest synchronous baseline' without explicit confirmation that it incorporates the same gated cross-attention and buffer mechanisms under a single clock. This comparison is load-bearing for the claim that the decoupled design produces the observed improvement.

Authors: The synchronous baseline is a standard synchronous VLA (the strongest publicly reported synchronous model at the time of submission) that processes all modalities at a fixed common rate and does not include per-modality buffers or gated cross-attention, because those components are defined to support asynchronous operation. We will revise the experiments section to provide an explicit architectural description of the baseline, confirm its synchronous clock, and add a table or paragraph contrasting the two designs. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture validated on real-robot tasks

full rationale

The paper advances an architectural hypothesis (decoupling temporal processing per modality via independent latent buffers refreshed at sensor rates, integrated by gated cross-attention) and reports empirical success-rate gains on seven contact-rich manipulation tasks. No equations, derivations, or predictions appear that reduce the reported outcomes to fitted parameters, self-definitions, or self-citation chains. The central claim is tested through implementation and external evaluation rather than by construction from its own inputs. No load-bearing self-citations, uniqueness theorems, or renamed known results are present.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is limited to the explicit hypothesis; no free parameters, invented entities, or additional axioms are stated.

axioms (1)

domain assumption Decoupling temporal processing per modality, letting each update and retain information at its own sensor rate, yields stronger representations and more robust control.
This is the central hypothesis stated directly in the abstract.

pith-pipeline@v0.9.1-grok · 5761 in / 1222 out tokens · 18606 ms · 2026-06-27T09:24:54.282831+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 10 linked inside Pith

[1]

Brohan, N

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

Pith/arXiv arXiv 2022
[2]

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

2025
[3]

Vanjani, P

P. Vanjani, P. Mattes, X. Jia, V . Dave, and R. Lioutikov. Disdp: Robust imitation learning via disentangled diffusion policies. InReinforcement Learning Conference, 2025

2025
[4]

Zheng, J

J. Zheng, J. Li, Z. Wang, D. Liu, X. Kang, Y . Feng, Y . Zheng, J. Zou, Y . Chen, J. Zeng, et al. X- vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model. arXiv preprint arXiv:2510.10274, 2025

Pith/arXiv arXiv 2025
[5]

Black, M

K. Black, M. Galliker, and S. Levine. Real-time execution of action chunking flow policies. Advances in Neural Information Processing Systems, 38:33383–33407, 2026

2026
[6]

Y . Zhao, L. Zhao, B. Cheng, G. Yao, X. Wen, and H. Gao. Vla-rail: A real-time asynchronous inference linker for vla models and robots.arXiv preprint arXiv:2512.24673, 2025

arXiv 2025
[7]

Sendai, M

K. Sendai, M. Alvarez, T. Matsushima, Y . Matsuo, and Y . Iwasawa. Leave no observation behind: Real-time correction for vla action chunks.arXiv preprint arXiv:2509.23224, 2025

arXiv 2025
[8]

H. Chen, J. Liu, C. Gu, Z. Liu, R. Zhang, X. Li, X. He, Y . Guo, C.-W. Fu, S. Zhang, et al. Fast- in-slow: A dual-system vla model unifying fast manipulation within slow reasoning.Advances in Neural Information Processing Systems, 38:98049–98083, 2026

2026
[9]

T. Zou, H. Zeng, Y . Nong, Y . Li, K. Liu, H. Yang, X. Ling, X. Li, and L. Ma. Asynchronous fast-slow vision-language-action policies for whole-body robotic manipulation.arXiv preprint arXiv:2512.20188, 2025

arXiv 2025
[10]

S. Xu, Y . Wang, C. Xia, D. Zhu, T. Huang, and C. Xu. Vla-cache: Towards efficient vision- language-action model via adaptive token caching in robotic manipulation.arXiv e-prints, pages arXiv–2502, 2025

2025
[11]

W. Qiu, T. Huang, and R. Ying. Efficient long-horizon vision-language-action models via static-dynamic disentanglement.arXiv preprint arXiv:2602.03983, 2026

Pith/arXiv arXiv 2026
[12]

C. Yang, Y . Hu, Y . Ma, Y . Yang, J. Tan, and H. Fan. Realtime-vla v2: Learning to run vlas fast, smooth, and accurate.arXiv preprint arXiv:2603.26360, 2026

arXiv 2026
[13]

J. Tang, Y . Sun, Y . Zhao, S. Yang, Y . Lin, Z. Zhang, J. Hou, Y . Lu, Z. Liu, and S. Han. Vlash: Real-time vlas via future-state-aware asynchronous inference.arXiv preprint arXiv:2512.01031, 2025

arXiv 2025
[14]

Y . Lu, Z. Liu, X. Fan, Z. Yang, J. Hou, J. Li, K. Ding, and H. Zhao. Faster: Rethinking real-time flow vlas.arXiv preprint arXiv:2603.19199, 2026

Pith/arXiv arXiv 2026
[15]

G. Lee, Y . Lee, K. Kim, S. Lee, S. Noh, S. Back, and K. Lee. Manipforce: Force-guided policy learning with frequency-aware representation for contact-rich manipulation.arXiv preprint arXiv:2509.19047, 2025

arXiv 2025
[16]

Zhang, H

Z. Zhang, H. Xu, Z. Yang, C. Yue, Z. Lin, H.-a. Gao, Z. Wang, and H. Zhao. Ta-vla: Elu- cidating the design space of torque-aware vision-language-action models.arXiv preprint arXiv:2509.07962, 2025. 10

arXiv 2025
[17]

Y . Li, H. Jiang, J. Xia, H. Zhang, J. Du, Y . Zhou, J. Zeng, C. Hao, J. Ren, Q. Yu, et al. Forcevla2: Unleashing hybrid force-position control with force awareness for contact-rich ma- nipulation.arXiv preprint arXiv:2603.15169, 2026

arXiv 2026
[18]

Zhang, H

K. Zhang, H. Zhang, Z. Xu, Z. Zhang, M. R. I. Prince, X. Li, X. Han, Y . Zhou, A. Ajoudani, and Y . She. Tacvla: Contact-aware tactile fusion for robust vision-language-action manipulation. arXiv preprint arXiv:2603.12665, 2026

arXiv 2026
[19]

Y . Li, P. Tang, W. Zhang, C. Zhu, Y . Duan, W. Shi, X. Zhang, Z. Yang, J. Ji, and Y . Zhang. Favla: A force-adaptive fast-slow vla model for contact-rich robotic manipulation.arXiv preprint arXiv:2602.23648, 2026

arXiv 2026
[20]

R. Zhao, W. Wang, Y . Ma, X. Li, F. E. Tay, M. H. Ang Jr, and H. Zhu. Fd-vla: Force-distilled vision-language-action model for contact-rich manipulation.arXiv preprint arXiv:2602.02142, 2026

arXiv 2026
[21]

Torne, K

M. Torne, K. Pertsch, H. Walke, K. Vedder, S. Nair, B. Ichter, A. Z. Ren, H. Wang, J. Tang, K. Stachowicz, et al. Mem: Multi-scale embodied memory for vision language action models. arXiv preprint arXiv:2603.03596, 2026

arXiv 2026
[22]

Sridhar, J

A. Sridhar, J. Pan, S. Sharma, and C. Finn. Memer: Scaling up memory for robot control via experience retrieval.arXiv preprint arXiv:2510.20328, 2025

arXiv 2025
[23]

Zheng, Y

R. Zheng, Y . Liang, S. Huang, J. Gao, H. Daum ´e III, A. Kolobov, F. Huang, and J. Yang. Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies. InInternational Conference on Learning Representations, volume 2025, pages 54277–54296, 2025

2025
[24]

M. Lin, P. Ding, S. Wang, Z. Zhuang, Y . Liu, X. Tong, W. Song, S. Lyu, S. Huang, and D. Wang. Hif-vla: Hindsight, insight and foresight through motion representation for vision- language-action models.arXiv preprint arXiv:2512.09928, 2025

Pith/arXiv arXiv 2025
[25]

M. Koo, D. Choi, T. Kim, K. Lee, C. Kim, Y . Seo, and J. Shin. Hamlet: Switch your vision- language-action model into a history-aware policy.arXiv preprint arXiv:2510.00695, 2025

Pith/arXiv arXiv 2025
[26]

H. Shi, B. Xie, Y . Liu, L. Sun, F. Liu, T. Wang, E. Zhou, H. Fan, X. Zhang, and G. Huang. Memoryvla: Perceptual-cognitive memory in vision-language-action models for robotic ma- nipulation.arXiv preprint arXiv:2508.19236, 2025

Pith/arXiv arXiv 2025
[27]

M. Lin, X. Liang, B. Lin, L. Jingzhi, Z. Jiao, K. Li, Y . Ma, Y . Liu, S. Zhao, Y . Zhuang, et al. Echovla: Robotic vision-language-action model with synergistic declarative memory for mobile manipulation.arXiv preprint arXiv:2511.18112, 2025

arXiv 2025
[28]

C. Ni, C. Chen, X. Wang, Z. Zhu, W. Zheng, B. Wang, T. Chen, G. Zhao, H. Li, Z. Dong, et al. Swiftvla: Unlocking spatiotemporal dynamics for lightweight vla models at minimal overhead.arXiv preprint arXiv:2512.00903, 2025

arXiv 2025
[29]

Y . Dai, H. Fu, J. Lee, Y . Liu, H. Zhang, J. Yang, C. Finn, N. Fazeli, and J. Chai. Robomme: Benchmarking and understanding memory for robotic generalist policies.arXiv preprint arXiv:2603.04639, 2026

Pith/arXiv arXiv 2026
[30]

Y . Gao, J. Liu, S. Li, and S. Song. Gated memory policy.arXiv preprint arXiv:2604.18933, 2026

Pith/arXiv arXiv 2026
[31]

Alayrac, J

J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Mil- lican, M. Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736, 2022. 11

2022
[32]

Shukor, D

M. Shukor, D. Aubakirova, F. Capuano, P. Kooijmans, S. Palma, A. Zouitine, M. Aractingi, C. Pascal, M. Russi, A. Marafioti, et al. Smolvla: A vision-language-action model for afford- able and efficient robotics.arXiv preprint arXiv:2506.01844, 2025

Pith/arXiv arXiv 2025
[33]

P. W. L ¨odige, M. X. Li, and R. Lioutikov. Use the force, bot!-force-aware prodmp with event- based replanning. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 16730–16736. IEEE, 2025. 12 Appendix A Robot Platform and Sensor Suite All real-world experiments are conducted on aFranka Emika Panda7-DoF robot arm equipped with aR...

2025

[1] [1]

Brohan, N

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

Pith/arXiv arXiv 2022

[2] [2]

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

2025

[3] [3]

Vanjani, P

P. Vanjani, P. Mattes, X. Jia, V . Dave, and R. Lioutikov. Disdp: Robust imitation learning via disentangled diffusion policies. InReinforcement Learning Conference, 2025

2025

[4] [4]

Zheng, J

J. Zheng, J. Li, Z. Wang, D. Liu, X. Kang, Y . Feng, Y . Zheng, J. Zou, Y . Chen, J. Zeng, et al. X- vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model. arXiv preprint arXiv:2510.10274, 2025

Pith/arXiv arXiv 2025

[5] [5]

Black, M

K. Black, M. Galliker, and S. Levine. Real-time execution of action chunking flow policies. Advances in Neural Information Processing Systems, 38:33383–33407, 2026

2026

[6] [6]

Y . Zhao, L. Zhao, B. Cheng, G. Yao, X. Wen, and H. Gao. Vla-rail: A real-time asynchronous inference linker for vla models and robots.arXiv preprint arXiv:2512.24673, 2025

arXiv 2025

[7] [7]

Sendai, M

K. Sendai, M. Alvarez, T. Matsushima, Y . Matsuo, and Y . Iwasawa. Leave no observation behind: Real-time correction for vla action chunks.arXiv preprint arXiv:2509.23224, 2025

arXiv 2025

[8] [8]

H. Chen, J. Liu, C. Gu, Z. Liu, R. Zhang, X. Li, X. He, Y . Guo, C.-W. Fu, S. Zhang, et al. Fast- in-slow: A dual-system vla model unifying fast manipulation within slow reasoning.Advances in Neural Information Processing Systems, 38:98049–98083, 2026

2026

[9] [9]

T. Zou, H. Zeng, Y . Nong, Y . Li, K. Liu, H. Yang, X. Ling, X. Li, and L. Ma. Asynchronous fast-slow vision-language-action policies for whole-body robotic manipulation.arXiv preprint arXiv:2512.20188, 2025

arXiv 2025

[10] [10]

S. Xu, Y . Wang, C. Xia, D. Zhu, T. Huang, and C. Xu. Vla-cache: Towards efficient vision- language-action model via adaptive token caching in robotic manipulation.arXiv e-prints, pages arXiv–2502, 2025

2025

[11] [11]

W. Qiu, T. Huang, and R. Ying. Efficient long-horizon vision-language-action models via static-dynamic disentanglement.arXiv preprint arXiv:2602.03983, 2026

Pith/arXiv arXiv 2026

[12] [12]

C. Yang, Y . Hu, Y . Ma, Y . Yang, J. Tan, and H. Fan. Realtime-vla v2: Learning to run vlas fast, smooth, and accurate.arXiv preprint arXiv:2603.26360, 2026

arXiv 2026

[13] [13]

J. Tang, Y . Sun, Y . Zhao, S. Yang, Y . Lin, Z. Zhang, J. Hou, Y . Lu, Z. Liu, and S. Han. Vlash: Real-time vlas via future-state-aware asynchronous inference.arXiv preprint arXiv:2512.01031, 2025

arXiv 2025

[14] [14]

Y . Lu, Z. Liu, X. Fan, Z. Yang, J. Hou, J. Li, K. Ding, and H. Zhao. Faster: Rethinking real-time flow vlas.arXiv preprint arXiv:2603.19199, 2026

Pith/arXiv arXiv 2026

[15] [15]

G. Lee, Y . Lee, K. Kim, S. Lee, S. Noh, S. Back, and K. Lee. Manipforce: Force-guided policy learning with frequency-aware representation for contact-rich manipulation.arXiv preprint arXiv:2509.19047, 2025

arXiv 2025

[16] [16]

Zhang, H

Z. Zhang, H. Xu, Z. Yang, C. Yue, Z. Lin, H.-a. Gao, Z. Wang, and H. Zhao. Ta-vla: Elu- cidating the design space of torque-aware vision-language-action models.arXiv preprint arXiv:2509.07962, 2025. 10

arXiv 2025

[17] [17]

Y . Li, H. Jiang, J. Xia, H. Zhang, J. Du, Y . Zhou, J. Zeng, C. Hao, J. Ren, Q. Yu, et al. Forcevla2: Unleashing hybrid force-position control with force awareness for contact-rich ma- nipulation.arXiv preprint arXiv:2603.15169, 2026

arXiv 2026

[18] [18]

Zhang, H

K. Zhang, H. Zhang, Z. Xu, Z. Zhang, M. R. I. Prince, X. Li, X. Han, Y . Zhou, A. Ajoudani, and Y . She. Tacvla: Contact-aware tactile fusion for robust vision-language-action manipulation. arXiv preprint arXiv:2603.12665, 2026

arXiv 2026

[19] [19]

Y . Li, P. Tang, W. Zhang, C. Zhu, Y . Duan, W. Shi, X. Zhang, Z. Yang, J. Ji, and Y . Zhang. Favla: A force-adaptive fast-slow vla model for contact-rich robotic manipulation.arXiv preprint arXiv:2602.23648, 2026

arXiv 2026

[20] [20]

R. Zhao, W. Wang, Y . Ma, X. Li, F. E. Tay, M. H. Ang Jr, and H. Zhu. Fd-vla: Force-distilled vision-language-action model for contact-rich manipulation.arXiv preprint arXiv:2602.02142, 2026

arXiv 2026

[21] [21]

Torne, K

M. Torne, K. Pertsch, H. Walke, K. Vedder, S. Nair, B. Ichter, A. Z. Ren, H. Wang, J. Tang, K. Stachowicz, et al. Mem: Multi-scale embodied memory for vision language action models. arXiv preprint arXiv:2603.03596, 2026

arXiv 2026

[22] [22]

Sridhar, J

A. Sridhar, J. Pan, S. Sharma, and C. Finn. Memer: Scaling up memory for robot control via experience retrieval.arXiv preprint arXiv:2510.20328, 2025

arXiv 2025

[23] [23]

Zheng, Y

R. Zheng, Y . Liang, S. Huang, J. Gao, H. Daum ´e III, A. Kolobov, F. Huang, and J. Yang. Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies. InInternational Conference on Learning Representations, volume 2025, pages 54277–54296, 2025

2025

[24] [24]

M. Lin, P. Ding, S. Wang, Z. Zhuang, Y . Liu, X. Tong, W. Song, S. Lyu, S. Huang, and D. Wang. Hif-vla: Hindsight, insight and foresight through motion representation for vision- language-action models.arXiv preprint arXiv:2512.09928, 2025

Pith/arXiv arXiv 2025

[25] [25]

M. Koo, D. Choi, T. Kim, K. Lee, C. Kim, Y . Seo, and J. Shin. Hamlet: Switch your vision- language-action model into a history-aware policy.arXiv preprint arXiv:2510.00695, 2025

Pith/arXiv arXiv 2025

[26] [26]

H. Shi, B. Xie, Y . Liu, L. Sun, F. Liu, T. Wang, E. Zhou, H. Fan, X. Zhang, and G. Huang. Memoryvla: Perceptual-cognitive memory in vision-language-action models for robotic ma- nipulation.arXiv preprint arXiv:2508.19236, 2025

Pith/arXiv arXiv 2025

[27] [27]

M. Lin, X. Liang, B. Lin, L. Jingzhi, Z. Jiao, K. Li, Y . Ma, Y . Liu, S. Zhao, Y . Zhuang, et al. Echovla: Robotic vision-language-action model with synergistic declarative memory for mobile manipulation.arXiv preprint arXiv:2511.18112, 2025

arXiv 2025

[28] [28]

C. Ni, C. Chen, X. Wang, Z. Zhu, W. Zheng, B. Wang, T. Chen, G. Zhao, H. Li, Z. Dong, et al. Swiftvla: Unlocking spatiotemporal dynamics for lightweight vla models at minimal overhead.arXiv preprint arXiv:2512.00903, 2025

arXiv 2025

[29] [29]

Y . Dai, H. Fu, J. Lee, Y . Liu, H. Zhang, J. Yang, C. Finn, N. Fazeli, and J. Chai. Robomme: Benchmarking and understanding memory for robotic generalist policies.arXiv preprint arXiv:2603.04639, 2026

Pith/arXiv arXiv 2026

[30] [30]

Y . Gao, J. Liu, S. Li, and S. Song. Gated memory policy.arXiv preprint arXiv:2604.18933, 2026

Pith/arXiv arXiv 2026

[31] [31]

Alayrac, J

J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Mil- lican, M. Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736, 2022. 11

2022

[32] [32]

Shukor, D

M. Shukor, D. Aubakirova, F. Capuano, P. Kooijmans, S. Palma, A. Zouitine, M. Aractingi, C. Pascal, M. Russi, A. Marafioti, et al. Smolvla: A vision-language-action model for afford- able and efficient robotics.arXiv preprint arXiv:2506.01844, 2025

Pith/arXiv arXiv 2025

[33] [33]

P. W. L ¨odige, M. X. Li, and R. Lioutikov. Use the force, bot!-force-aware prodmp with event- based replanning. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 16730–16736. IEEE, 2025. 12 Appendix A Robot Platform and Sensor Suite All real-world experiments are conducted on aFranka Emika Panda7-DoF robot arm equipped with aR...

2025