KERV: Kinematic-Rectified Speculative Decoding for Embodied VLA Models

Zihao Zheng , Zhihao Mao , Maoliang Li , Jiayu Chen , Xinhao Sun , Zhaobo Zhang , Donggang Cao , Hong Mei

show 1 more author

Xiang Chen

Authors on Pith no claims yet

classification 💻 cs.RO cs.LG

keywords addresserrorsissueskervmodelsthresholdacceptanceadjustment

0 comments

read the original abstract

Vision-Language-Action (VLA) models build a token-domain robot control paradigm, yet suffer from low speed. Speculative Decoding (SD) is an optimization strategy that can boost inference speed. Two key issues emerge when integrating VLA and SD: first, SD relies on re-inference to address token errors, which is computationally expensive; second, to mitigate token errors, the acceptance threshold in SD requires careful adjustment. Existing works fail to address the above two issues effectively. Meanwhile, as the bridge between AI and the physical world, existing embodied intelligence has overlooked the application of robotic kinematics. To address these issues, we innovatively combine token-domain VLA models with kinematic-domain prediction for SD, proposing a kinematic-rectified SD framework named KERV. We employ a kinematics-based Kalman Filter to predict actions and compensate for SD errors, avoiding costly re-inference. Moreover, we design a kinematics-based adjustment strategy to dynamically rectify the acceptance threshold, addressing the difficulty of threshold determination. Experimental results across diverse tasks and environments demonstrate that KERV achieves 27%~37% acceleration with nearly no Success Rate loss.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Realtime-VLA FLASH: Speculative Inference Framework for Diffusion-based VLAs
cs.RO 2026-05 unverdicted novelty 7.0

A new speculative inference system speeds up diffusion VLAs to 19.1 ms average latency (3.04x faster) on LIBERO by replacing most full 58 ms inferences with 7.8 ms draft rounds while preserving task performance.
FreqCache: Accelerating Embodied VLN Models with Adaptive Frequency-Guided Token Caching
cs.RO 2026-04 unverdicted novelty 6.0

FreqCache uses frequency domain properties to adaptively select, refresh, and budget token caches in VLN models, delivering 1.59x speedup with negligible overhead.
RoboECC: Multi-Factor-Aware Edge-Cloud Collaborative Deployment for VLA Models
cs.DC 2026-03 unverdicted novelty 5.0

RoboECC delivers up to 3.28x speedup for VLA model inference via co-aware segmentation and network-aware adjustment with 2.55-2.62% overhead.