pith. machine review for the scientific record. sign in

arxiv: 2605.13632 · v1 · submitted 2026-05-13 · 💻 cs.RO · cs.CV

Recognition: unknown

Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models

Authors on Pith no claims yet

Pith reviewed 2026-05-14 18:27 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords Vision-Language-ActionEmbodied reasoningInteractive guidanceSpatial priorsChain-of-ThoughtRobot controlOut-of-domain robustness
0
0 comments X

The pith

GTA-VLA lets users steer vision-language-action models with explicit spatial visual cues for better robot control.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GTA-VLA, a framework that accepts optional user-provided spatial priors such as points, boxes, and traces to condition a unified spatial-visual Chain-of-Thought. Existing direct Sense-to-Act policies work inside training distributions but fail under visual shifts or when errors occur, and prior embodied reasoning methods lack a direct path for human spatial correction. By aligning external guidance with internal planning before passing to a lightweight action head, the model achieves 81.2 percent success on the SimplerEnv WidowX benchmark and shows clear gains from one interaction on out-of-domain cases. A sympathetic reader cares because this turns brittle robot policies into steerable ones that can recover without retraining.

Core claim

The GTA-VLA framework enables spatially steerable embodied reasoning by allowing users to supply affordance points, boxes, and traces that the model directly conditions on when generating a unified spatial-visual Chain-of-Thought, which integrates human visual intent with autonomous task planning and is executed through a coupled lightweight reactive action head.

What carries the argument

The unified spatial-visual Chain-of-Thought that integrates external user spatial priors with internal task planning before action generation.

If this is right

  • Achieves a state-of-the-art 81.2 percent success rate on the in-domain SimplerEnv WidowX benchmark.
  • A single visual interaction substantially raises task success under out-of-domain visual shifts and spatial ambiguities.
  • Enables recovery from failures in embodied control by aligning human guidance with model reasoning.
  • Couples the reasoning module with a lightweight reactive action head for efficient execution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same conditioning mechanism could support multi-turn guidance for longer task sequences without requiring new training data.
  • This style of explicit visual steering may transfer to other domains that already use human oversight, such as teleoperated or assistive systems.
  • Combining the guidance input with existing correction techniques could further reduce the need for full policy retraining after deployment.

Load-bearing premise

Users will supply accurate, task-relevant spatial priors that the model can integrate without creating new errors or ambiguities.

What would settle it

An experiment that supplies deliberately inaccurate or ambiguous spatial cues and measures whether success rates fall below non-interactive baselines under the same out-of-domain visual shifts.

Figures

Figures reproduced from arXiv: 2605.13632 by Chuanxiu Liu, Jie Liu, Jinghang Li, Lei Zhang, Qing Jiang, Qing Lian, Tianming Zhang, Xiaoke Jiang, Yiran Ling.

Figure 1
Figure 1. Figure 1: Conventional direct VLA policies can fail under spatial ambiguity or imprecise grounding, since they [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of GTA-VLA (Guide, Think, Act). The framework consists of three stages. Guide: the model receives the primary image, the language instruction, and optional spatial priors (e.g., affordance points, boxes, or traces). Think: the VLM backbone generates a conditioned spatial-visual reasoning sequence and the corresponding latent reasoning states Hreasoning. Act: a downstream Flow-Matching action head … view at source ↗
Figure 3
Figure 3. Figure 3: Interact-306K and automatic instruction annotation. Left: Dataset composition: 306K episodes collected from six manipulation sources (e.g., Bridge [30], Fractal [38], Droid [16], and RoboMind vari￾ants [32]). Right: Automatic annotation pipeline: keyframe extraction and task decomposition from trajec￾tories, followed by open-vocabulary grounding and tracking to produce structured subtask instructions with … view at source ↗
Figure 4
Figure 4. Figure 4: Real-world robot deployment. Left: the experimental setup with the Agile Piper robot, a primary [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Simpler WidowX Base Benchmark [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Simpler Google Robot Base Benchmark [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of real-time CoT output results during operation [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visualization for Guidance Efficiency Evaluation [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Visualization For Visual Shift and Object Shift in Simpler Plus Benchmark [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
read the original abstract

In this paper, we propose GTA-VLA(Guide, Think, Act), an interactive Vision-Language-Action (VLA) framework that enables spatially steerable embodied reasoning by allowing users to guide robot policies with explicit visual cues. Existing VLA models learn a direct "Sense-to-Act" mapping from multimodal observations to robot actions. While effective within the training distribution, such tightly coupled policies are brittle under out-of-domain (OOD) shifts and difficult to correct when failures occur. Although recent embodied Chain-of-Thought (CoT) approaches expose intermediate reasoning, they still lack a mechanism for incorporating human spatial guidance, limiting their ability to resolve visual ambiguities or recover from mistakes. To address this gap, our framework allows users to optionally guide the policy with spatial priors, such as affordance points, boxes, and traces, which the subsequent reasoning process can directly condition on. Based on these inputs, the model generates a unified spatial-visual Chain-of-Thought that integrates external guidance with internal task planning, aligning human visual intent with autonomous decision-making. For practical deployment, we further couple the reasoning module with a lightweight reactive action head for efficient action execution. Extensive experiments demonstrate the effectiveness of our approach. On the in-domain SimplerEnv WidowX benchmark, our framework achieves a state-of-the-art 81.2% success rate. Under OOD visual shifts and spatial ambiguities, a single visual interaction substantially improves task success over existing methods, highlighting the value of interactive reasoning for failure recovery in embodied control. Details of the project can be found here: https://signalispupupu.github.io/GTA-VLA_ProjPage/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes GTA-VLA, an interactive Vision-Language-Action framework that augments existing VLA models with optional user-provided spatial priors (affordance points, boxes, traces) to produce a unified spatial-visual Chain-of-Thought, followed by a lightweight reactive action head. It reports a state-of-the-art 81.2% success rate on the in-domain SimplerEnv WidowX benchmark and substantial gains under OOD visual shifts and spatial ambiguities from a single visual interaction, positioning the approach as a way to improve failure recovery and robustness beyond direct Sense-to-Act mappings or standard embodied CoT.

Significance. If the performance claims hold under rigorous verification, the work offers a practical mechanism for human spatial guidance in embodied agents, addressing a clear limitation in current VLA brittleness to distribution shifts. The empirical focus on interactive correction rather than purely autonomous reasoning could influence future designs for deployable robotics systems, provided the integration of external priors proves reliable.

major comments (3)
  1. [Experimental results] Experimental results section: the headline 81.2% in-domain success rate and OOD gains are presented without reported statistical significance, trial counts, variance, or exact baseline implementations and hyper-parameters, rendering the SOTA claim only partially verifiable from the text.
  2. [Method and experiments] Framework description and experiments: the central assumption that user spatial priors integrate cleanly without introducing new error modes lacks supporting ablations on input noise, sensitivity analysis, or quantitative comparison of guided vs. unguided failure cases, which directly bears on whether the reported OOD improvements are robust.
  3. [Evaluation] Evaluation protocol: no failure-case analysis or breakdown of how the unified spatial-visual CoT resolves (or fails to resolve) specific ambiguities is provided, leaving the mechanism for interactive recovery underspecified relative to the strength of the claims.
minor comments (1)
  1. [Method] The project page link is given but the manuscript would benefit from a brief self-contained description of the exact spatial prior encoding scheme (e.g., how points/boxes/traces are tokenized and fused).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below and will revise the manuscript to incorporate the suggested improvements where appropriate.

read point-by-point responses
  1. Referee: [Experimental results] Experimental results section: the headline 81.2% in-domain success rate and OOD gains are presented without reported statistical significance, trial counts, variance, or exact baseline implementations and hyper-parameters, rendering the SOTA claim only partially verifiable from the text.

    Authors: We agree that the current presentation leaves the SOTA claim only partially verifiable. Each task was evaluated over 50 independent trials; we will add the exact trial counts, standard deviations (approximately ±2.8% around the 81.2% mean), and a supplementary table that documents the precise baseline implementations together with the hyper-parameters taken from the original papers. These details will be inserted into the Experimental Results section. revision: yes

  2. Referee: [Method and experiments] Framework description and experiments: the central assumption that user spatial priors integrate cleanly without introducing new error modes lacks supporting ablations on input noise, sensitivity analysis, or quantitative comparison of guided vs. unguided failure cases, which directly bears on whether the reported OOD improvements are robust.

    Authors: We acknowledge that explicit validation of robustness to noisy priors is missing. In the revised manuscript we will add an ablation study that injects controlled Gaussian noise into affordance points and boxes, report the resulting performance curves, and include a quantitative comparison of failure rates between guided and unguided runs under the same OOD visual shifts. These results will appear in a new subsection of the Experiments section. revision: yes

  3. Referee: [Evaluation] Evaluation protocol: no failure-case analysis or breakdown of how the unified spatial-visual CoT resolves (or fails to resolve) specific ambiguities is provided, leaving the mechanism for interactive recovery underspecified relative to the strength of the claims.

    Authors: We will expand the evaluation protocol with a dedicated failure-case analysis subsection. It will contain both qualitative examples illustrating how the spatial-visual CoT resolves particular ambiguities (e.g., occlusion or multi-object confusion) and quantitative success-rate breakdowns stratified by ambiguity type. This addition will make the interactive-recovery mechanism explicit. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical VLA architecture on existing components

full rationale

The paper presents GTA-VLA as an empirical framework extending existing VLA models and embodied CoT reasoning by incorporating optional user spatial priors (points, boxes, traces) into a unified spatial-visual reasoning process, followed by a reactive action head. Reported results consist of benchmark success rates (e.g., 81.2% on SimplerEnv WidowX) obtained through experiments, with no equations, derivations, or parameter fits that reduce any claimed prediction or result to the same inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes; the work is self-contained as an architectural proposal validated externally on standard benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces no new physical entities or unproven mathematical axioms. It relies on standard transformer-based VLA and CoT assumptions already present in the cited prior work.

axioms (1)
  • domain assumption Pre-trained VLA models and CoT reasoning modules can be extended with additional visual conditioning inputs without loss of core capabilities.
    The framework assumes existing VLA backbones remain effective when augmented with user spatial cues.

pith-pipeline@v0.9.0 · 5622 in / 1223 out tokens · 38767 ms · 2026-05-14T18:27:09.637078+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 16 internal anchors

  1. [1]

    Qwen3-VL Technical Report

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

  2. [2]

    Belkhale, T

    Belkhale, S., Ding, T., Xiao, T., Sermanet, P., Vuong, Q., Tompson, J., Chebotar, Y., Dwibedi, D., Sadigh, D.: Rt-h: Action hierarchies using language. arXiv preprint arXiv:2403.01823 (2024) 1, 3

  3. [3]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al.:π 0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164 (2024) 1, 3, 4, 7, 9

  4. [4]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., et al.:π 0.5: A vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054 (2025) 1, 3, 9

  5. [5]

    SAM 3: Segment Anything with Concepts

    Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala, K.V., Khedr, H., Huang, A., Lei, J., Ma, T., Guo, B., Kalla, A., Marks, M., Greer, J., Wang, M., Sun, P., R¨ adle, R., Afouras, T., Mavroudi, E., Xu, K., Wu, T.H., Zhou, Y., Momeni, L., Hazra, R., Ding, S., Vaze, S., Porcher, F., Li, F., Li, S., Kamath, A., Cheng, H.K....

  6. [6]

    Gr-3 technical report.arXiv preprint arXiv:2507.15493, 2025

    Cheang, C., Chen, S., Cui, Z., Hu, Y., Huang, L., Kong, T., Li, H., Li, Y., Liu, Y., Ma, X., Niu, H., Ou, W., Peng, W., Ren, Z., Shi, H., Tian, J., Wu, H., Xiao, X., Xiao, Y., Xu, J., Yang, Y.: Gr-3 technical report. arXiv preprint arXiv:2507.15493 (2025) 1, 3

  7. [7]

    Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

    Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., Zhao, R.: Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195 (2023) 3

  8. [8]

    InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy

    Chen, X., Chen, Y., Fu, Y., Gao, N., Jia, J., Jin, W., Li, H., Mu, Y., Pang, J., Qiao, Y., Tian, Y., Wang, B., Wang, B., Wang, F., Wang, H., Wang, T., Wang, Z., Wei, X., Wu, C., Yang, S., Ye, J., Yu, J., Zeng, J., Zhang, J., Zhang, J., Zhang, S., Zheng, F., Zhou, B., Zhu, Y.: Internvla-m1: A spatially guided vision-language-action framework for generalist...

  9. [9]

    The International Journal of Robotics Research44(10-11), 1684–1704 (2025) 1, 3

    Chi, C., Xu, Z., Feng, S., Cousineau, E., Du, Y., Burchfiel, B., Tedrake, R., Song, S.: Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research44(10-11), 1684–1704 (2025) 1, 3

  10. [10]

    In: CoRL (2025) 1, 3

    Deng, S., Yan, M., Wei, S., Ma, H., Yang, Y., Chen, J., Zhang, Z., Yang, T., Zhang, X., Cui, H., et al.: Graspvla: A grasping foundation model pre-trained on billion-scale synthetic action data. In: CoRL (2025) 1, 3

  11. [11]

    In: ICLR (2023) 1, 3

    Gu, J., Kirmani, S., Wohlhart, P., Lu, Y., Arenas, M.G., Rao, K., Yu, W., Fu, C., Gopalakrishnan, K., Xu, Z., et al.: Rt-trajectory: Robotic task generalization via hindsight trajectory sketches. In: ICLR (2023) 1, 3

  12. [12]

    arXiv preprint arXiv:2601.09708 (2026) 3

    Huang, C.P., Man, Y., Yu, Z., Chen, M.H., Kautz, J., Wang, Y.C.F., Yang, F.E.: Fast-thinkact: Efficient vision- language-action reasoning via verbalizable latent planning. arXiv preprint arXiv:2601.09708 (2026) 3

  13. [13]

    In: NeurIPS (2025) 1, 3, 9

    Huang, C.P., Wu, Y.H., Chen, M.H., Wang, Y.C.F., Yang, F.E.: Thinkact: Vision-language-action reasoning via reinforced visual latent planning. In: NeurIPS (2025) 1, 3, 9

  14. [14]

    arXiv preprint arXiv:2510.12798 (2025) 3

    Jiang, Q., Huo, J., Chen, X., Xiong, Y., Zeng, Z., Chen, Y., Ren, T., Yu, J., Zhang, L.: Detect anything via next point prediction. arXiv preprint arXiv:2510.12798 (2025) 3

  15. [15]

    In: ECCV (2024) 3

    Jiang, Q., Li, F., Zeng, Z., Ren, T., Liu, S., Zhang, L.: T-rex2: Towards generic object detection via text-visual prompt synergy. In: ECCV (2024) 3

  16. [16]

    In: Robotics: Science and Systems (2024) 3, 7, 8

    Khazatsky, A., Pertsch, K., Nair, S., Balakrishna, A., Dasari, S., Karamcheti, S., Nasiriany, S., Srirama, M.K., Chen, L.Y., Ellis, K., et al.: Droid: A large-scale in-the-wild robot manipulation dataset. In: Robotics: Science and Systems (2024) 3, 7, 8

  17. [17]

    Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

    Kim, M.J., Finn, C., Liang, P.: Fine-tuning vision-language-action models: Optimizing speed and success. arXiv preprint arXiv:2502.19645 (2025) 9

  18. [18]

    In: CoRL (2025) 1, 3, 9

    Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E.P., Sanketi, P.R., Vuong, Q., et al.: Openvla: An open-source vision-language-action model. In: CoRL (2025) 1, 3, 9

  19. [19]

    In: CVPR (2023) 3

    Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. In: CVPR (2023) 3

  20. [20]

    MolmoAct: Action Reasoning Models that can Reason in Space

    Lee, J., Duan, J., Fang, H., Deng, Y., Liu, S., Li, B., Fang, B., Zhang, J., Wang, Y.R., Lee, S., Han, W., Pumacay, W., Wu, A., Hendrix, R., Farley, K., VanderBilt, E., Farhadi, A., Fox, D., Krishna, R.: Molmoact: Action reasoning models that can reason in space. arXiv preprint arXiv:2508.07917 (2025) 1, 3, 9

  21. [21]

    CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

    Li, Q., Liang, Y., Wang, Z., Luo, L., Chen, X., Liao, M., Wei, F., Deng, Y., Xu, S., Zhang, Y., Wang, X., Liu, B., Fu, J., Bao, J., Chen, D., Shi, Y., Yang, J., Guo, B.: Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation. arXiv preprint arXiv:2411.19650 (2024) 1, 3 14 Y. Ling, Q. Lian et al

  22. [22]

    Evaluating Real-World Robot Manipulation Policies in Simulation

    Li, X., Hsu, K., Gu, J., Pertsch, K., Mees, O., Walke, H.R., Fu, C., Lunawat, I., Sieh, I., Kirmani, S., Levine, S., Wu, J., Finn, C., Su, H., Vuong, Q., Xiao, T.: Evaluating real-world robot manipulation policies in simulation. arXiv preprint arXiv:2405.05941 (2024) 8

  23. [23]

    NeurIPS36, 44776–44791 (2023) 8

    Liu, B., Zhu, Y., Gao, C., Feng, Y., Liu, Q., Zhu, Y., Stone, P.: Libero: Benchmarking knowledge transfer for lifelong robot learning. NeurIPS36, 44776–44791 (2023) 8

  24. [24]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    NVIDIA, Bjorck, J., Casta˜ neda, F., Cherniadev, N., Da, X., Ding, R., Fan, L.J., Fang, Y., Fox, D., Hu, F., Huang, S., Jang, J., Jiang, Z., Kautz, J., Kundalia, K., Lao, L., Li, Z., Lin, Z., Lin, K., Liu, G., Llontop, E., Magne, L., Mandlekar, A., Narayan, A., Nasiriany, S., Reed, S., Tan, Y.L., Wang, G., Wang, Z., Wang, J., Wang, Q., Xiang, J., Xie, Y.,...

  25. [25]

    O’Neill, A., Rehman, A., Maddukuri, A., Gupta, A., Padalkar, A., Lee, A., Pooley, A., Gupta, A., Mandlekar, A., Jain, A., et al.: Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration

  26. [26]

    In: IEEE International Conference on Robotics and Automation (2024) 3, 7

  27. [27]

    SAM 2: Segment Anything in Images and Videos

    Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., R¨ adle, R., Rolland, C., Gustafson, L., Mintun, E., Pan, J., Alwala, K.V., Carion, N., Wu, C.Y., Girshick, R., Doll´ ar, P., Feichtenhofer, C.: Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024) 3

  28. [28]

    arXiv preprint arXiv:2512.08580 (2025) 1, 3

    Tang, P., Xie, S., Sun, B., Huang, B., Luo, K., Yang, H., Jin, W., Wang, J.: Mind to hand: Purposeful robotic control via embodied reasoning. arXiv preprint arXiv:2512.08580 (2025) 1, 3

  29. [29]

    Seed1.5-VL Technical Report

    Team, B.S.: Seed1.5-vl technical report. arXiv preprint arXiv:2505.07062 (2025) 3

  30. [30]

    Octo: An Open-Source Generalist Robot Policy

    Team, O.M., Ghosh, D., Walke, H., Pertsch, K., Black, K., Mees, O., Dasari, S., Hejna, J., Kreiman, T., Xu, C., Luo, J., Tan, Y.L., Chen, L.Y., Sanketi, P., Vuong, Q., Xiao, T., Sadigh, D., Finn, C., Levine, S.: Octo: An open-source generalist robot policy. arXiv preprint arXiv:2405.12213 (2024) 1, 3

  31. [31]

    In: CoRL (2023) 7, 8

    Walke, H., Black, K., Lee, A., Kim, M.J., Du, M., Zheng, C., Zhao, T., Hansen-Estruch, P., Vuong, Q., He, A., Myers, V., Fang, K., Finn, C., Levine, S.: Bridgedata v2: A dataset for robot learning at scale. In: CoRL (2023) 7, 8

  32. [32]

    Unified vision-language-action model.arXiv preprint arXiv:2506.19850, 2025

    Wang, Y., Li, X., Wang, W., Zhang, J., Li, Y., Chen, Y., Wang, X., Zhang, Z.: Unified vision-language-action model. arXiv preprint arXiv:2506.19850 (2025) 9

  33. [33]

    In: Robotics: Science and Systems (2025) 3, 7, 8

    Wu, K., Hou, C., Liu, J., Che, Z., Ju, X., Yang, Z., Li, M., Zhao, Y., Xu, Z., Yang, G., et al.: Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation. In: Robotics: Science and Systems (2025) 3, 7, 8

  34. [34]

    Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

    Yang, J., Zhang, H., Li, F., Zou, X., yue Li, C., Gao, J.: Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441 (2023) 3

  35. [35]

    In: ICLR (2023) 3

    You, H., Zhang, H., Gan, Z., Du, X., Zhang, B., Wang, Z., Cao, L., Chang, S.F., Yang, Y.: Ferret: Refer and ground anything anywhere at any granularity. In: ICLR (2023) 3

  36. [36]

    In: CoRL

    Zawalski, M., Chen, W., Pertsch, K., Mees, O., Finn, C., Levine, S.: Robotic control via embodied chain-of-thought reasoning. In: CoRL. pp. 3157–3181. PMLR (2025) 1, 3

  37. [37]

    In: CVPR

    Zhao, Q., Lu, Y., Kim, M.J., Fu, Z., Zhang, Z., Wu, Y., Li, Z., Ma, Q., Han, S., Finn, C., et al.: Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. In: CVPR. pp. 1702–1713 (2025) 9

  38. [38]

    X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

    Zheng, J., Li, J., Wang, Z., Liu, D., Kang, X., Feng, Y., Zheng, Y., Zou, J., Chen, Y., Zeng, J., Zhang, Y.Q., Pang, J., Liu, J., Wang, T., Zhan, X.: X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language- action model. arXiv preprint arXiv:2510.10274 (2025) 1, 3, 4, 7, 9

  39. [39]

    stack the green block on the yellow block

    Zitkovich, B., Yu, T., Xu, S., Xu, P., Xiao, T., Xia, F., Wu, J., Wohlhart, P., Welker, S., Wahid, A., et al.: Rt-2: Vision-language-action models transfer web knowledge to robotic control. In: CoRL (2023) 1, 3, 8 Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models 15 6 More Visualization Results Fig. 5: Simpler WidowX Base ...