pith. sign in

arxiv: 2606.03784 · v2 · pith:UEPZNDS4new · submitted 2026-06-02 · 💻 cs.RO

Revisiting Embodied Chain-of-Thought for Generalizable Robot Manipulation

Pith reviewed 2026-06-28 09:37 UTC · model grok-4.3

classification 💻 cs.RO
keywords embodied chain-of-thoughtvision-language-action modelsrobot manipulationreasoning dropoutgeneralizationaction groundingmanipulation benchmarks
0
0 comments X

The pith

Embodied chain-of-thought improves vision-language-action models when used only as training supervision rather than as autoregressive prefixes at inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how to integrate embodied chain-of-thought into vision-language-action models for robot manipulation tasks. It builds a large corpus of trajectories that include explicit reasoning traces and tests multiple ways of incorporating them. The central result is that high-level reasoning alone adds little value, but grounding it in concrete action descriptions helps when the traces shape the model's internal representations during training. Using a reasoning-dropout approach lets the model absorb those traces without generating them at test time, which avoids compounding errors from autoregressive decoding. This produces higher success rates on manipulation benchmarks and better out-of-distribution performance than baselines that rely on test-time chain-of-thought.

Core claim

Effective embodied chain-of-thought grounds high-level semantic understanding into concrete action guidance such as end-effector movement descriptions and image-space trajectories. Explicit chain-of-thought used as an autoregressive action prefix at inference suffers from compounding errors and unstable reasoning-action coupling. Training with a reasoning-dropout strategy instead allows the model to absorb rich reasoning traces during training while predicting actions directly without chain-of-thought decoding at inference, which improves scalability with more pre-training data and yields stronger generalization.

What carries the argument

The reasoning-dropout strategy, which supplies embodied chain-of-thought traces as supervision during training but disables their generation at inference time.

If this is right

  • Grounding chain-of-thought to concrete action descriptions outperforms high-level reasoning alone.
  • Autoregressive chain-of-thought prefixes become less reliable as model scale and task horizon increase.
  • The training-only approach scales more stably with larger pre-training datasets.
  • Out-of-distribution tasks that require semantic disambiguation or long-horizon execution show the largest gains.
  • Real-robot performance improves over baselines especially on tasks needing precise action grounding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same supervision-without-generation pattern may transfer to other sequential prediction domains where intermediate reasoning helps but direct output is preferred at runtime.
  • Even larger embodied datasets could further amplify the gap between training-time supervision and test-time direct prediction.
  • Hybrid variants could selectively re-enable partial chain-of-thought only on tasks where semantic ambiguity remains high after training.

Load-bearing premise

That the model can internalize useful representations from explicit chain-of-thought traces during training so that direct action prediction at inference time retains the benefits without the errors of autoregressive prefixes.

What would settle it

Train two otherwise identical models on the same corpus, one with reasoning-dropout and one without, then measure whether the dropout version shows measurably lower success rates on the LIBERO-Plus or VLABench suites.

Figures

Figures reproduced from arXiv: 2606.03784 by Huaping Liu, Jun Guo, Nan Sun, Peiyan Li, Pengxiang Ding, Runze Suo, Wentao Zhao, Wenxuan Song, Xinghang Li, Xin Xiao, Yifei Su, Yongkun Yang, Yuan Zhang.

Figure 1
Figure 1. Figure 1: Overview of ERVLA. We build the largest and most comprehensive embodied CoT corpus to date, which serves as the reasoning pre-training foundation for ERVLA. ERVLA uses CoT supervision, auxiliary action queries, knowledge-truncated KV conditioning, and reasoning dropout to internalize reasoning into action-aware VLM representations and predict continuous actions without mandatory test-time CoT. It achieves … view at source ↗
Figure 2
Figure 2. Figure 2: Embodied CoT data format and statistics. We introduce a hierarchical embodied CoT schema that decomposes robot manipulation into task understanding, planning, spatial grounding, and action-level descriptions. Each sample connects language reasoning with multi-view observations and executable cues. Using this schema, we construct a large-scale public embodied CoT dataset covering 2,592.5 hours, 978,743 epis… view at source ↗
Figure 3
Figure 3. Figure 3: Demonstration of ERVLA architecture. ERVLA integrates explicit embodied CoT super￾vision into the VLM backbone, uses auxiliary action-query regression to align semantic reasoning with action, applies knowledge truncation so that the action model attends only to the semantic-prefix KV cache, and generates continuous actions through a diffusion transformer with flow-matching. 2 Methodology We start by system… view at source ↗
Figure 4
Figure 4. Figure 4: Next-token action decoding after CoT is brittle, while knowledge insulation limits ac￾tion feedback. Coupling ECoT with choice pol￾icy and flow supervision enables reasoning and action generation to co-adapt for robust control. ECoT scales through synergistic reasoning– action representation learning. The right panel of [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: VLM-to-VLA transfer and embodied CoT scaling. Left: ECoT better aligns VLM capability with action. Right: ERVLA scales steadily with more CoT data on both LIBERO-Plus and VLABench, whereas AR CoT+Fast and isolated VLM+DiT show weaker or saturated scaling [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Real-world evaluation. Left: representative rollouts across basic, distractor, semantic, and long-horizon settings. Right: average success rate and progress score across four difficulty tiers, where ERVLA shows stronger robustness under semantic and long-horizon real-world generalization. knowledge truncation to give DiT cleaner semantic memory, and flow loss for end-to-end action feedback. Embodied CoT th… view at source ↗
Figure 7
Figure 7. Figure 7: Embodied CoT dataset construction pipeline. Raw robot trajectories are converted into structured embodied reasoning supervision through trajectory segmentation, episode-level planning, action-oriented motion annotation, geometric gripper projection, future point-trajectory construction, and sparse object grounding. The resulting annotations provide hierarchical, grounded, and action￾oriented CoT signals fo… view at source ↗
Figure 8
Figure 8. Figure 8: Example of multi-view spatial CoT annotation. The same objects and future gripper motion are represented under different camera views, enabling view-consistent grounding and action￾oriented reasoning. Point trajectory is removed from wrist views. we normalize it to the same [0, 1000] coordinate system: b˜ = hj1000 x1 W m , j 1000 y1 H m , j 1000 x2 W m , j 1000 y2 H mi . (20) The normalized boxes are store… view at source ↗
Figure 9
Figure 9. Figure 9: Example frame-level ECoT annotation from LIBERO-10. The annotation contains multi-view object grounding, episode-level planning, frame-level subtask reasoning, and action￾oriented spatial cues. Coordinates are normalized to the [0, 1000] image coordinate system. C.4 VLM-to-VLA Transfer Study The second set of experiments studies whether stronger VLMs become stronger VLAs when embodied CoT is used as a tran… view at source ↗
Figure 10
Figure 10. Figure 10: Visualization of VLABench evaluation results [PITH_FULL_IMAGE:figures/full_fig_p029_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Visualization of LIBERO-PLUS evaluation results [PITH_FULL_IMAGE:figures/full_fig_p030_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Visualization of real-world evaluation results. instructions, evaluation protocol, training setup, and quantitative breakdown across difficulty tiers. As shown in [PITH_FULL_IMAGE:figures/full_fig_p031_12.png] view at source ↗
read the original abstract

Embodied chain-of-thought (CoT) aims to bridge linguistic reasoning and robotic control, but its effective form and integration strategy remain underexplored. In this paper, we revisit embodied CoT for vision-language-action (VLA) models at large scale. We construct the largest embodied CoT corpus to date, comprising 978,743 trajectories, 226.3M samples, and 2592.5 hours of robot data. Through extensive experiments, we find that effective embodied CoT should ground high-level semantic understanding into concrete action guidance, such as end-effector movement descriptions and image-space trajectories, while high-level reasoning alone brings only marginal gains. We further show that explicit CoT does not scale reliably when used as an autoregressive action prefix, as it suffers from compounding inference errors and unstable reasoning-action coupling. To address these limitations, we propose ERVLA, a VLA model that uses embodied CoT as representation-shaping supervision rather than mandatory test-time reasoning. ERVLA is trained with a reasoning-dropout strategy, enabling the model to absorb rich reasoning traces during training while predicting actions directly without CoT decoding during inference. This design improves scalability with increasing pre-training data and avoids autoregressive instability. ERVLA achieves state-of-the-art performance on LIBERO-Plus with an 86.9% success rate and reaches 53.2% success rate on VLABench, demonstrating strong out-of-distribution generalization. In real-robot experiments, ERVLA further outperforms competitive state-of-the-art baselines, especially on tasks requiring semantic disambiguation and long-horizon execution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript claims that embodied chain-of-thought (CoT) is best used as representation-shaping supervision in vision-language-action (VLA) models rather than as an autoregressive prefix at inference time. By constructing a large corpus of 978,743 trajectories with grounded CoT (end-effector movements, image-space trajectories), and training ERVLA with reasoning-dropout, the model absorbs reasoning during training but predicts actions directly at test time. This yields SOTA results: 86.9% success on LIBERO-Plus, 53.2% on VLABench, and superior real-robot performance on semantic disambiguation and long-horizon tasks.

Significance. If the empirical claims hold, the work provides a practical method to incorporate rich embodied reasoning into scalable VLA training without the instability of test-time autoregressive CoT, which could advance generalizable robot manipulation policies. The scale of the dataset (226.3M samples, 2592.5 hours) is a notable contribution.

major comments (3)
  1. [Abstract] Abstract: The central performance claims (86.9% on LIBERO-Plus, 53.2% on VLABench) are presented without error bars, detailed baseline comparisons, ablation tables, or dataset construction protocol, making it impossible to assess whether the gains stem from the CoT supervision mechanism or from the scale of the 978k-trajectory corpus.
  2. [Abstract] Abstract (method description): The claim that 'explicit CoT does not scale reliably when used as an autoregressive action prefix' due to compounding errors is load-bearing for motivating ERVLA, but no quantitative comparison of inference errors or controlled ablation isolating CoT traces versus data volume is referenced.
  3. [Abstract] Abstract (experiments summary): No representation-level analysis (e.g., probing classifiers on encoders or embedding similarity metrics between CoT-trained and baseline models) is mentioned to directly support that the training internalizes useful grounded representations, leaving open the possibility that performance differences arise from confounding factors in the corpus.
minor comments (1)
  1. [Abstract] Abstract: The description of the corpus size ('978,743 trajectories, 226.3M samples, and 2592.5 hours') could benefit from clarification on how samples and hours are counted relative to trajectories.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point by point below, clarifying the supporting evidence in the full manuscript and noting revisions to improve the abstract's clarity.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central performance claims (86.9% on LIBERO-Plus, 53.2% on VLABench) are presented without error bars, detailed baseline comparisons, ablation tables, or dataset construction protocol, making it impossible to assess whether the gains stem from the CoT supervision mechanism or from the scale of the 978k-trajectory corpus.

    Authors: The abstract is a concise summary. The full manuscript reports error bars (standard deviations over multiple runs) in the experimental tables, provides detailed baseline comparisons in Table 1, ablation tables in Table 3, and the dataset construction protocol in Section 3.1. Ablations control for data volume while varying CoT components to isolate the supervision effect. We will revise the abstract to note that results include error bars and are supported by these controlled analyses. revision: yes

  2. Referee: [Abstract] Abstract (method description): The claim that 'explicit CoT does not scale reliably when used as an autoregressive action prefix' due to compounding errors is load-bearing for motivating ERVLA, but no quantitative comparison of inference errors or controlled ablation isolating CoT traces versus data volume is referenced.

    Authors: The manuscript provides quantitative comparisons of inference errors for autoregressive CoT in Section 4.3, documenting compounding errors and instability. Controlled ablations in Section 4.4 isolate CoT traces while fixing data volume. We will revise the abstract to reference these quantitative results more explicitly. revision: yes

  3. Referee: [Abstract] Abstract (experiments summary): No representation-level analysis (e.g., probing classifiers on encoders or embedding similarity metrics between CoT-trained and baseline models) is mentioned to directly support that the training internalizes useful grounded representations, leaving open the possibility that performance differences arise from confounding factors in the corpus.

    Authors: The full manuscript includes representation-level analyses in Section 4.5, with probing classifiers and embedding similarity metrics demonstrating that ERVLA internalizes more grounded representations than baselines. These results help attribute gains to the supervision mechanism rather than corpus confounders. We will revise the abstract to mention these analyses. revision: yes

Circularity Check

0 steps flagged

No circularity: results are external benchmark success rates

full rationale

The paper's central claims rest on measured success rates on held-out benchmarks (LIBERO-Plus at 86.9%, VLABench at 53.2%) and real-robot experiments. These quantities are independent of any internal fitted parameters or self-referential definitions. No equations, uniqueness theorems, or derivations are presented that reduce to the inputs by construction. The training procedure (CoT supervision + reasoning-dropout) is described as a method, but its effect is evaluated empirically on external data rather than asserted tautologically. This matches the default case of a non-circular empirical ML paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities beyond the standard supervised-learning assumptions implicit in VLA training; ERVLA itself is a modeling choice rather than a new physical entity.

axioms (1)
  • domain assumption Standard supervised learning assumptions on trajectory data suffice to transfer reasoning benefits to direct action prediction
    Implicit in the claim that dropout training internalizes CoT benefits

pith-pipeline@v0.9.1-grok · 5865 in / 1176 out tokens · 21847 ms · 2026-06-28T09:37:58.297791+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. E-TTS: A New Embodied Test-Time Scaling Framework for Robotic Manipulation

    cs.RO 2026-06 unverdicted novelty 6.0

    E-TTS introduces a plug-and-play test-time scaling method for embodied tasks that unifies reasoning-action sampling with history buffers and closed-loop refinement to improve performance on manipulation benchmarks.

Reference graph

Works this paper leans on

67 extracted references · 36 linked inside Pith · cited by 1 Pith paper

  1. [1]

    Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems, 2025

    AgiBot-World-Contributors, Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, Shu Jiang, Yuxin Jiang, Cheng Jing, Hongyang Li, Jialu Li, Chiming Liu, Yi Liu, Yuxiang Lu, Jianlan Luo, Ping Luo, Yao Mu, Yuehan Niu, Yixuan Pan, Jiangmiao Pang, Yu Qiao, Guanghui Ren, Cheng Ruan, Jiaqi Shan, Yongjian...

  2. [2]

    Qwen3-vl technical report, 2025

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  3. [3]

    Paligemma: A versatile 3b vlm for transfer, 2024

    Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, Thomas Unterthiner, Daniel Keysers, Skanda Koppula, Fangyu Liu, Adam Grycner, Alexey Gritsenko, Neil Houlsby, Manoj Kumar, Keran Rong, Julian Eisenschlos, Rishabh Kabra, Matthias Bau...

  4. [4]

    Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

    Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

  5. [5]

    Galliker, and Sergey Levine

    Kevin Black, Manuel Y . Galliker, and Sergey Levine. Real-time execution of action chunking flow policies, 2025. URLhttps://arxiv.org/abs/2506.07339

  6. [6]

    π0: A vision-language-action flow model for general robot control, 2026

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky. π0: A visi...

  7. [7]

    Univla: Learning to act anywhere with task-centric latent actions, 2025

    Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions, 2025. URLhttps://arxiv.org/abs/2505.06111

  8. [8]

    Xiaomi-robotics-0: An open-sourced vision-language-action model with real-time execution.arXiv preprint arXiv:2602.12684, 2026

    Rui Cai, Jun Guo, Xinze He, Piaopiao Jin, Jie Li, Bingxuan Lin, Futeng Liu, Wei Liu, Fei Ma, Kun Ma, et al. Xiaomi-robotics-0: An open-sourced vision-language-action model with real-time execution.arXiv preprint arXiv:2602.12684, 2026

  9. [9]

    Worldvla: Towards autoregressive action world model, 2025

    Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, Deli Zhao, and Hao Chen. Worldvla: Towards autoregressive action world model, 2025. URLhttps://arxiv.org/abs/2506.21539

  10. [10]

    Training strategies for efficient embodied reasoning, 2025

    William Chen, Suneel Belkhale, Suvir Mirchandani, Oier Mees, Danny Driess, Karl Pertsch, and Sergey Levine. Training strategies for efficient embodied reasoning, 2025. URL https: //arxiv.org/abs/2505.08243

  11. [11]

    Diffusion policy: Visuomotor policy learning via action diffusion

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

  12. [12]

    Starvla: A lego-like codebase for vision-language-action model develop- ing, 2026

    StarVLA Community. Starvla: A lego-like codebase for vision-language-action model develop- ing, 2026. URLhttps://arxiv.org/abs/2604.05014

  13. [13]

    Ren, Homer Walke, Quan Vuong, Lucy Xiaoyang Shi, and Sergey Levine

    Danny Driess, Jost Tobias Springenberg, Brian Ichter, Lili Yu, Adrian Li-Bell, Karl Pertsch, Allen Z. Ren, Homer Walke, Quan Vuong, Lucy Xiaoyang Shi, and Sergey Levine. Knowledge insulating vision-language-action models: Train fast, run fast, generalize better, 2025. URL https://arxiv.org/abs/2505.23705

  14. [14]

    Vlmevalkit: An open-source toolkit for evaluating large multi-modality models

    Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. InProceedings of the 32nd ACM International Conference on Multimedia, pages 11198–11201, 2024

  15. [15]

    Libero-plus: In-depth robustness analysis of vision-language-action models, 2025

    Senyu Fei, Siyin Wang, Junhao Shi, Zihao Dai, Jikun Cai, Pengfang Qian, Li Ji, Xinzhe He, Shiduo Zhang, Zhaoye Fei, Jinlan Fu, Jingjing Gong, and Xipeng Qiu. Libero-plus: In-depth robustness analysis of vision-language-action models, 2025. URL https://arxiv.org/abs/ 2510.13626

  16. [16]

    Llmdet: Learning strong open-vocabulary object detectors under the supervision of large language models, 2025

    Shenghao Fu, Qize Yang, Qijie Mo, Junkai Yan, Xihan Wei, Jingke Meng, Xiaohua Xie, and Wei-Shi Zheng. Llmdet: Learning strong open-vocabulary object detectors under the supervision of large language models, 2025. URLhttps://arxiv.org/abs/2501.18954

  17. [17]

    Fractal: An ultra-large-scale aerial lidar dataset for 3d semantic segmentation of diverse landscapes, 2024

    Charles Gaydon, Michel Daab, and Floryne Roche. Fractal: An ultra-large-scale aerial lidar dataset for 3d semantic segmentation of diverse landscapes, 2024. URL https://arxiv.org/ abs/2405.04634

  18. [18]

    Thinkact: Vision-language-action reasoning via reinforced visual latent planning, 2025

    Chi-Pin Huang, Yueh-Hua Wu, Min-Hung Chen, Yu-Chiang Frank Wang, and Fu-En Yang. Thinkact: Vision-language-action reasoning via reinforced visual latent planning, 2025. URL https://arxiv.org/abs/2507.16815. 11

  19. [19]

    Fast-thinkact: Efficient vision-language-action reasoning via verbalizable latent planning, 2026

    Chi-Pin Huang, Yunze Man, Zhiding Yu, Min-Hung Chen, Jan Kautz, Yu-Chiang Frank Wang, and Fu-En Yang. Fast-thinkact: Efficient vision-language-action reasoning via verbalizable latent planning, 2026. URLhttps://arxiv.org/abs/2601.09708

  20. [20]

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsc...

  21. [21]

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abraham Lee, You...

  22. [22]

    Robot-r1: Reinforcement learning for enhanced embodied reasoning in robotics.arXiv preprint arXiv:2506.00070, 2025

    Dongyoung Kim, Sumin Park, Huiwon Jang, Jinwoo Shin, Jaehyung Kim, and Younggyo Seo. Robot-r1: Reinforcement learning for enhanced embodied reasoning in robotics.arXiv preprint arXiv:2506.00070, 2025

  23. [23]

    Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  24. [24]

    Fine-tuning vision-language-action models: Optimizing speed and success, 2025

    Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success, 2025. URLhttps://arxiv.org/abs/2502.19645

  25. [25]

    Cosmos policy: Fine-tuning video models for visuomotor control and planning.arXiv preprint arXiv:2601.16163, 2026

    Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, et al. Cosmos policy: Fine-tuning video models for visuomotor control and planning.arXiv preprint arXiv:2601.16163, 2026

  26. [26]

    Molmoact: Action reasoning models that can reason in space, 2025

    Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, Winson Han, Wilbert Pumacay, Angelica Wu, Rose Hendrix, Karen Farley, Eli VanderBilt, Ali Farhadi, Dieter Fox, and Ranjay Krishna. Molmoact: Action reasoning models that can reason in space, 2025. URL https://arxiv.org/abs/ 2508.07917

  27. [27]

    Spatial forcing: Implicit spatial representation alignment for vision- language-action model, 2025

    Fuhao Li, Wenxuan Song, Han Zhao, Jingbo Wang, Pengxiang Ding, Donglin Wang, Long Zeng, and Haoang Li. Spatial forcing: Implicit spatial representation alignment for vision- language-action model, 2025. URLhttps://arxiv.org/abs/2510.12276

  28. [28]

    Bridgevla: Input-output alignment for efficient 3d manipulation learning with vision-language models, 2025

    Peiyan Li, Yixiang Chen, Hongtao Wu, Xiao Ma, Xiangnan Wu, Yan Huang, Liang Wang, Tao Kong, and Tieniu Tan. Bridgevla: Input-output alignment for efficient 3d manipulation learning with vision-language models, 2025. URLhttps://arxiv.org/abs/2506.07961. 12

  29. [29]

    Cogact: A foundational vision-language- action model for synergizing cognition and action in robotic manipulation, 2024

    Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, Xiaofan Wang, Bei Liu, Jianlong Fu, Jianmin Bao, Dong Chen, Yuanchun Shi, Jiaolong Yang, and Baining Guo. Cogact: A foundational vision-language- action model for synergizing cognition and action in robotic manipulation, 2024. URL https: //ar...

  30. [30]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling, 2023. URLhttps://arxiv.org/abs/2210.02747

  31. [31]

    Libero: Benchmarking knowledge transfer for lifelong robot learning, 2023

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning, 2023. URL https: //arxiv.org/abs/2306.03310

  32. [32]

    Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024

    Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024

  33. [33]

    Faster: Toward efficient autoregressive vision language action modeling via neural action tokenization.arXiv preprint arXiv:2512.04952, 2025

    Yicheng Liu, Shiduo Zhang, Zibin Dong, Baijun Ye, Tianyuan Yuan, Xiaopeng Yu, Linqi Yin, Chenhao Lu, Junhao Shi, Luca Jiang-Tao Yu, et al. Faster: Toward efficient autoregressive vision language action modeling via neural action tokenization.arXiv preprint arXiv:2512.04952, 2025

  34. [34]

    Last 0: Latent spatio-temporal chain-of-thought for robotic vision-language- action model, 2026

    Zhuoyang Liu, Jiaming Liu, Hao Chen, Jiale Yu, Ziyu Guo, Chengkai Hou, Chenyang Gu, Xiangju Mi, Renrui Zhang, Kun Wu, Zhengping Che, Jian Tang, Pheng-Ann Heng, and Shanghang Zhang. Last 0: Latent spatio-temporal chain-of-thought for robotic vision-language- action model, 2026. URLhttps://arxiv.org/abs/2601.05248

  35. [35]

    Romero, Misha Smelyanskiy, Shuran Song, Lyne Tchapmi, Andrew Z

    NVIDIA, :, Alisson Azzolini, Junjie Bai, Hannah Brandon, Jiaxin Cao, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, Liang Feng, Francesco Ferroni, Rama Govindaraju, Jinwei Gu, Siddharth Gururani, Imad El Hanafi, Zekun Hao, Jacob Huffman, Jingyi Jin, Brendan Johnson, Rizwan Khan, George Kurian, Elena Lantz, Nayeon Lee,...

  36. [36]

    URLhttps://arxiv.org/abs/2503.15558

  37. [37]

    Open x- embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0

    Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x- embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

  38. [38]

    mimic-video: Video-action models for generalizable robot control beyond vlas, 2025

    Jonas Pai, Liam Achenbach, Victoriano Montesinos, Benedek Forrai, Oier Mees, and Elvis Nava. mimic-video: Video-action models for generalizable robot control beyond vlas, 2025. URLhttps://arxiv.org/abs/2512.15692

  39. [39]

    Scalable diffusion models with transformers, 2023

    William Peebles and Saining Xie. Scalable diffusion models with transformers, 2023. URL https://arxiv.org/abs/2212.09748

  40. [40]

    Fast: Efficient action tokenization for vision-language-action models, 2025

    Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models, 2025. URLhttps://arxiv.org/abs/2501.09747

  41. [41]

    Coordinated humanoid manipulation with choice policies, 2025

    Haozhi Qi, Yen-Jen Wang, Toru Lin, Brent Yi, Yi Ma, Koushil Sreenath, and Jitendra Malik. Coordinated humanoid manipulation with choice policies, 2025. URL https://arxiv.org/ abs/2512.25072

  42. [42]

    Eo-1: Interleaved vision-text-action pretraining for general robot control.arXiv preprint arXiv:2508.21112, 2025

    Delin Qu, Haoming Song, Qizhi Chen, Zhaoqing Chen, Xianqiang Gao, Xinyi Ye, Qi Lv, Modi Shi, Guanghui Ren, Cheng Ruan, et al. Eo-1: Interleaved vision-text-action pretraining for general robot control.arXiv preprint arXiv:2508.21112, 2025. 13

  43. [43]

    Qwen2.5 technical report, 2025

    Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

  44. [44]

    Hi robot: Open-ended instruction following with hierarchical vision-language-action models.arXiv preprint arXiv:2502.19417, 2025

    Lucy Xiaoyang Shi, Brian Ichter, Michael Equi, Liyiming Ke, Karl Pertsch, Quan Vuong, James Tanner, Anna Walling, Haohuan Wang, Niccolo Fusai, et al. Hi robot: Open-ended instruction following with hierarchical vision-language-action models.arXiv preprint arXiv:2502.19417, 2025

  45. [45]

    Emma-x: An embodied multimodal action model with grounded chain of thought and look-ahead spatial reasoning, 2024

    Qi Sun, Pengfei Hong, Tej Deep Pala, Vernon Toh, U-Xuan Tan, Deepanway Ghosal, and Sou- janya Poria. Emma-x: An embodied multimodal action model with grounded chain of thought and look-ahead spatial reasoning, 2024. URLhttps://arxiv.org/abs/2412.11974

  46. [46]

    Mind to hand: Purposeful robotic control via embodied reasoning, 2025

    Peijun Tang, Shangjin Xie, Binyan Sun, Baifu Huang, Kuncheng Luo, Haotian Yang, Weiqi Jin, and Jianan Wang. Mind to hand: Purposeful robotic control via embodied reasoning, 2025. URLhttps://arxiv.org/abs/2512.08580

  47. [47]

    Gemini robotics 1.5: Pushing the frontier of generalist robots with advanced embodied reasoning, thinking, and motion transfer.arXiv preprint arXiv:2510.03342, 2025

    Gemini Robotics Team, Abbas Abdolmaleki, Saminda Abeyruwan, Joshua Ainslie, Jean- Baptiste Alayrac, Montserrat Gonzalez Arenas, Ashwin Balakrishna, Nathan Batchelor, Alex Bewley, Jeff Bingham, et al. Gemini robotics 1.5: Pushing the frontier of generalist robots with advanced embodied reasoning, thinking, and motion transfer.arXiv preprint arXiv:2510.03342, 2025

  48. [48]

    Qwen3.5: Accelerating productivity with native multimodal agents, February

    Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February

  49. [49]

    URLhttps://qwen.ai/blog?id=qwen3.5

  50. [50]

    Bridgedata v2: A dataset for robot learning at scale, 2024

    Homer Walke, Kevin Black, Abraham Lee, Moo Jin Kim, Max Du, Chongyi Zheng, Tony Zhao, Philippe Hansen-Estruch, Quan Vuong, Andre He, Vivek Myers, Kuan Fang, Chelsea Finn, and Sergey Levine. Bridgedata v2: A dataset for robot learning at scale, 2024. URL https://arxiv.org/abs/2308.12952

  51. [51]

    Vq-vla: Improving vision-language-action models via scaling vector-quantized action tokenizers

    Yating Wang, Haoyi Zhu, Mingyu Liu, Jiange Yang, Hao-Shu Fang, and Tong He. Vq-vla: Improving vision-language-action models via scaling vector-quantized action tokenizers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11089– 11099, 2025

  52. [52]

    Dexvla: Vision-language model with plug-in diffusion expert for general robot control.arXiv preprint arXiv:2502.05855, 2025

    Junjie Wen, Yichen Zhu, Jinming Li, Zhibin Tang, Chaomin Shen, and Feifei Feng. Dexvla: Vision-language model with plug-in diffusion expert for general robot control.arXiv preprint arXiv:2502.05855, 2025

  53. [53]

    Do what you say: Steering vision-language-action models via runtime reasoning- action alignment verification.arXiv preprint arXiv:2510.16281, 2025

    Yilin Wu, Anqi Li, Tucker Hermans, Fabio Ramos, Andrea Bajcsy, and Claudia PÊrez- D’Arpino. Do what you say: Steering vision-language-action models via runtime reasoning- action alignment verification.arXiv preprint arXiv:2510.16281, 2025

  54. [54]

    Florence-2: Advancing a unified representation for a variety of vision tasks, 2023

    Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu, Michael Zeng, Ce Liu, and Lu Yuan. Florence-2: Advancing a unified representation for a variety of vision tasks, 2023. URLhttps://arxiv.org/abs/2311.06242

  55. [55]

    World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

    Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

  56. [56]

    Embodied-r1: Reinforced embodied reasoning for general robotic manipulation

    Yifu Yuan, Haiqin Cui, Yaoting Huang, Yibin Chen, Fei Ni, Zibin Dong, Pengyi Li, Yan Zheng, and Jianye Hao. Embodied-r1: Reinforced embodied reasoning for general robotic manipulation. arXiv preprint arXiv:2508.13998, 2025

  57. [57]

    Robotic control via embodied chain-of-thought reasoning.arXiv preprint arXiv:2407.08693, 2024

    Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning.arXiv preprint arXiv:2407.08693, 2024. 14

  58. [58]

    Hancock, Mingtong Zhang, Tenny Yin, Yixuan Huang, Dhruv Shah, Allen Z

    Lihan Zha, Asher J. Hancock, Mingtong Zhang, Tenny Yin, Yixuan Huang, Dhruv Shah, Allen Z. Ren, and Anirudha Majumdar. Lap: Language-action pre-training enables zero-shot cross-embodiment transfer, 2026. URLhttps://arxiv.org/abs/2602.10556

  59. [59]

    Vlm4vla: Revisiting vision-language-models in vision-language-action models.arXiv preprint arXiv:2601.03309, 2026

    Jianke Zhang, Xiaoyu Chen, Qiuyue Wang, Mingsheng Li, Yanjiang Guo, Yucheng Hu, Jiajun Zhang, Shuai Bai, Junyang Lin, and Jianyu Chen. Vlm4vla: Revisiting vision-language-models in vision-language-action models.arXiv preprint arXiv:2601.03309, 2026

  60. [60]

    Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long-horizon reasoning tasks, 2024

    Shiduo Zhang, Zhe Xu, Peiju Liu, Xiaopeng Yu, Yuan Li, Qinghui Gao, Zhaoye Fei, Zhangyue Yin, Zuxuan Wu, Yu-Gang Jiang, and Xipeng Qiu. Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long-horizon reasoning tasks, 2024. URL https://arxiv.org/abs/2412.18194

  61. [61]

    Cot-vla: Visual chain-of-thought reasoning for vision-language- action models, 2025

    Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, Ankur Handa, Ming-Yu Liu, Donglai Xiang, Gordon Wetzstein, and Tsung-Yi Lin. Cot-vla: Visual chain-of-thought reasoning for vision-language- action models, 2025. URLhttps://arxiv.org/abs/2503.22020

  62. [62]

    Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023

    Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023

  63. [63]

    X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model, 2025

    Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, Ya-Qin Zhang, Jiangmiao Pang, Jingjing Liu, Tai Wang, and Xianyuan Zhan. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model, 2025. URLhttps://arxiv.org/abs/2510.10274

  64. [64]

    Pokevla: Empowering pocket-sized vision-language-action model with comprehensive world knowledge guidance, 2026

    Yupeng Zheng, Xiang Li, Songen Gu, Yuhang Zheng, Shuai Tian, Weize Li, Linbo Wang, Senyu Fei, Pengfei Li, Yinfeng Gao, Zebin Xing, Yilun Chen, Qichao Zhang, Haoran Li, and Wenchao Ding. Pokevla: Empowering pocket-sized vision-language-action model with comprehensive world knowledge guidance, 2026. URLhttps://arxiv.org/abs/2604.20834

  65. [65]

    Acot-vla: Action chain-of-thought for vision-language-action models, 2026

    Linqing Zhong, Yi Liu, Yifei Wei, Ziyu Xiong, Maoqing Yao, Si Liu, and Guanghui Ren. Acot-vla: Action chain-of-thought for vision-language-action models, 2026. URL https: //arxiv.org/abs/2601.11404

  66. [66]

    Beast: Efficient tokenization of b-splines encoded action sequences for imitation learning.arXiv preprint arXiv:2506.06072, 2025

    Hongyi Zhou, Weiran Liao, Xi Huang, Yucheng Tang, Fabian Otto, Xiaogang Jia, Xinkai Jiang, Simon Hilber, Ge Li, Qian Wang, et al. Beast: Efficient tokenization of b-splines encoded action sequences for imitation learning.arXiv preprint arXiv:2506.06072, 2025

  67. [67]

    move back 3 cm, move left 9 cm, move down 3 cm, keep gripper open

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. 15 A Related Work A.1 Embodied Reasoning in Robot Manipulation Vision-language-action ...