Revisiting Embodied Chain-of-Thought for Generalizable Robot Manipulation

Huaping Liu; Jun Guo; Nan Sun; Peiyan Li; Pengxiang Ding; Runze Suo; Wentao Zhao; Wenxuan Song; Xinghang Li; Xin Xiao

arxiv: 2606.03784 · v2 · pith:UEPZNDS4new · submitted 2026-06-02 · 💻 cs.RO

Revisiting Embodied Chain-of-Thought for Generalizable Robot Manipulation

Nan Sun , Yuan Zhang , Yongkun Yang , Wentao Zhao , Peiyan Li , Jun Guo , Wenxuan Song , Pengxiang Ding

show 5 more authors

Runze Suo Yifei Su Xin Xiao Xinghang Li Huaping Liu

This is my paper

classification 💻 cs.RO

keywords embodiedervlareasoningactionautoregressivechain-of-thoughtdataduring

0 comments

read the original abstract

Embodied chain-of-thought (CoT) aims to bridge linguistic reasoning and robotic control, but its effective form and integration strategy remain underexplored. In this paper, we revisit embodied CoT for vision-language-action (VLA) models at large scale. We construct the largest embodied CoT corpus to date, comprising 978,743 trajectories, 226.3M samples, and 2592.5 hours of robot data. Through extensive experiments, we find that effective embodied CoT should ground high-level semantic understanding into concrete action guidance, such as end-effector movement descriptions and image-space trajectories, while high-level reasoning alone brings only marginal gains. We further show that explicit CoT does not scale reliably when used as an autoregressive action prefix, as it suffers from compounding inference errors and unstable reasoning-action coupling. To address these limitations, we propose ERVLA, a VLA model that uses embodied CoT as representation-shaping supervision rather than mandatory test-time reasoning. ERVLA is trained with a reasoning-dropout strategy, enabling the model to absorb rich reasoning traces during training while predicting actions directly without CoT decoding during inference. This design improves scalability with increasing pre-training data and avoids autoregressive instability. ERVLA achieves state-of-the-art performance on LIBERO-Plus with an 86.9% success rate and reaches 53.2% success rate on VLABench, demonstrating strong out-of-distribution generalization. In real-robot experiments, ERVLA further outperforms competitive state-of-the-art baselines, especially on tasks requiring semantic disambiguation and long-horizon execution.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

E-TTS: A New Embodied Test-Time Scaling Framework for Robotic Manipulation
cs.RO 2026-06 unverdicted novelty 6.0

E-TTS introduces a plug-and-play test-time scaling method for embodied tasks that unifies reasoning-action sampling with history buffers and closed-loop refinement to improve performance on manipulation benchmarks.