pith. machine review for the scientific record. sign in

arxiv: 2502.19417 · v2 · submitted 2025-02-26 · 💻 cs.RO · cs.AI· cs.LG

Recognition: 1 theorem link

Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models

Authors on Pith no claims yet

Pith reviewed 2026-05-15 22:49 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.LG
keywords hierarchical modelsvision-language-actionopen-ended instruction followingsituated feedbackgeneralist robotsrobotic manipulationmulti-platform evaluation
0
0 comments X

The pith

A hierarchical vision-language model lets robots interpret complex instructions and real-time feedback to choose and perform next steps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes a two-level system in which a vision-language model first reasons over open-ended natural-language commands and situated comments to decide the appropriate next action, after which a low-level policy carries out the physical motion. This structure is shown to succeed on tasks such as preparing a vegetarian sandwich, clearing a table, and grocery shopping, where simple direct commands would fail. The approach is demonstrated on single-arm, dual-arm, and mobile dual-arm robots without task-specific reprogramming. A reader would care because it moves generalist robots closer to handling ambiguous human requests in everyday environments rather than requiring perfectly scripted inputs.

Core claim

The paper claims that separating high-level reasoning from low-level control through a vision-language model allows a robot to process intricate prompts and incorporate corrective feedback during execution, enabling it to complete multi-step tasks that direct instruction-following methods cannot handle.

What carries the argument

Hierarchical vision-language-action model: a high-level VLM maps language and visual feedback to the next sub-goal, while a separate low-level policy translates that sub-goal into robot actions.

If this is right

  • Robots can now respond to verbal corrections mid-task instead of requiring all instructions upfront.
  • The same high-level model can be reused across single-arm, dual-arm, and mobile platforms with only the low-level policy swapped.
  • Tasks that combine object manipulation with semantic understanding, such as distinguishing trash from food, become executable without custom code for each scenario.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The architecture may reduce the amount of robot-specific demonstration data needed for new tasks by leveraging the pre-trained vision-language model's reasoning.
  • Extending the high-level layer to longer-horizon planning could allow robots to generate entire task sequences from a single high-level goal.
  • Real-time feedback integration opens the possibility of safer shared workspaces where humans can verbally redirect the robot without physical intervention.

Load-bearing premise

The high-level vision-language model reliably converts open-ended instructions and visual feedback into correct next-step decisions without misinterpreting context or inventing invalid actions.

What would settle it

Run the robot on a table-cleaning task with an item the user labels 'that's not trash' and observe whether it correctly avoids removing that item while still clearing the rest of the table.

read the original abstract

Generalist robots that can perform a range of different tasks in open-world settings must be able to not only reason about the steps needed to accomplish their goals, but also process complex instructions, prompts, and even feedback during task execution. Intricate instructions (e.g., "Could you make me a vegetarian sandwich?" or "I don't like that one") require not just the ability to physically perform the individual steps, but the ability to situate complex commands and feedback in the physical world. In this work, we describe a system that uses vision-language models in a hierarchical structure, first reasoning over complex prompts and user feedback to deduce the most appropriate next step to fulfill the task, and then performing that step with low-level actions. In contrast to direct instruction following methods that can fulfill simple commands ("pick up the cup"), our system can reason through complex prompts and incorporate situated feedback during task execution ("that's not trash"). We evaluate our system across three robotic platforms, including single-arm, dual-arm, and dual-arm mobile robots, demonstrating its ability to handle tasks such as cleaning messy tables, making sandwiches, and grocery shopping. Videos are available at https://www.pi.website/research/hirobot

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents Hi Robot, a hierarchical vision-language-action architecture in which a high-level VLM first interprets open-ended natural-language instructions and situated visual feedback to select the next appropriate step, after which low-level action models execute the chosen primitive. The system is demonstrated qualitatively across three robot platforms (single-arm, dual-arm, and mobile dual-arm) on tasks such as table cleaning, sandwich assembly, and grocery shopping, with emphasis on its ability to incorporate corrective feedback such as “that’s not trash.”

Significance. A reliably working hierarchical decomposition could advance generalist robotics by enabling handling of nuanced, context-dependent instructions that direct VLM prompting struggles with. The multi-platform evaluation suggests some degree of transferability, yet the complete absence of quantitative metrics prevents any assessment of how large or consistent the claimed advantage actually is.

major comments (2)
  1. [Evaluation] Evaluation section: the central claim that the high-level VLM “can reason through complex prompts and incorporate situated feedback” rests entirely on qualitative video demonstrations; no success rates, error rates, confusion matrices, or controlled tests of feedback incorporation (e.g., accuracy on prompts containing “that’s not trash”) are reported, leaving the robustness assumption unmeasured.
  2. [Evaluation] Evaluation section: no baseline comparison to direct (non-hierarchical) VLM instruction following is provided, so the asserted superiority of the hierarchical structure cannot be quantified or even verified against the simpler alternative the paper contrasts with.
minor comments (2)
  1. [Implementation details] The manuscript should state the exact VLM checkpoints and prompting templates used for the high-level reasoner so that the qualitative results can be reproduced.
  2. [Figures and videos] Figure captions and video descriptions should explicitly link each clip to the specific feedback-handling behavior being illustrated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the potential of hierarchical decomposition for handling nuanced instructions and feedback. We agree that stronger quantitative evaluation is needed and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: the central claim that the high-level VLM “can reason through complex prompts and incorporate situated feedback” rests entirely on qualitative video demonstrations; no success rates, error rates, confusion matrices, or controlled tests of feedback incorporation (e.g., accuracy on prompts containing “that’s not trash”) are reported, leaving the robustness assumption unmeasured.

    Authors: We agree that the current evaluation is primarily qualitative. In the revised manuscript we will add quantitative success rates obtained from repeated trials on the demonstrated tasks (table cleaning, sandwich assembly, grocery shopping) across the three platforms. We will also include a controlled test measuring the high-level VLM’s accuracy in correctly updating the plan when given corrective feedback phrases such as “that’s not trash.” revision: yes

  2. Referee: [Evaluation] Evaluation section: no baseline comparison to direct (non-hierarchical) VLM instruction following is provided, so the asserted superiority of the hierarchical structure cannot be quantified or even verified against the simpler alternative the paper contrasts with.

    Authors: We acknowledge the absence of a direct baseline comparison. We will add experiments that run the identical tasks using direct (non-hierarchical) VLM prompting and report comparative success rates, thereby quantifying the advantage of the hierarchical separation of high-level reasoning from low-level control. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system description without derivation chain

full rationale

The paper describes a hierarchical VLM-based robotic control system for open-ended instructions and feedback, evaluated via qualitative demonstrations on three platforms for tasks such as sandwich-making and grocery shopping. No equations, fitted parameters, uniqueness theorems, or self-citations that reduce claims to inputs appear in the provided text. The central claims rest on described experiments rather than any mathematical reduction or self-referential construction, satisfying the default expectation of no significant circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach assumes pre-trained VLMs possess sufficient grounded reasoning to map language and images to valid next steps without additional training or verification mechanisms described.

axioms (1)
  • domain assumption Pre-trained vision-language models can accurately deduce appropriate next physical steps from complex natural-language instructions and visual feedback.
    Invoked in the description of the high-level reasoning layer.

pith-pipeline@v0.9.0 · 5565 in / 1097 out tokens · 50966 ms · 2026-05-15T22:49:39.001717+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Characterizing Vision-Language-Action Models across XPUs: Constraints and Acceleration for On-Robot Deployment

    cs.RO 2026-04 unverdicted novelty 7.0

    VLA models exhibit a compute-bound VLM phase followed by a memory-bound action phase on edge hardware; DP-Cache and V-AEFusion reduce redundancy and enable pipeline parallelism for up to 6x speedup on NPUs with margin...

  2. ${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities

    cs.LG 2026-04 unverdicted novelty 7.0

    π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.

  3. QuadAgent: A Responsive Agent System for Vision-Language Guided Quadrotor Agile Flight

    cs.RO 2026-04 unverdicted novelty 7.0

    QuadAgent uses an asynchronous multi-agent architecture with an Impression Graph for scene memory and vision-based avoidance to enable training-free vision-language guided agile quadrotor flight, outperforming baselin...

  4. VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models

    cs.RO 2026-03 unverdicted novelty 7.0

    VP-VLA decouples high-level reasoning from low-level control in VLA models by rendering spatial anchors as visual prompts directly in the RGB observation space, outperforming end-to-end baselines.

  5. UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models

    cs.RO 2026-02 unverdicted novelty 7.0

    UniLACT improves VLA models by adding depth-aware unified latent action pretraining that outperforms RGB-only baselines on seen and unseen manipulation tasks.

  6. Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

    cs.CL 2025-11 unverdicted novelty 7.0

    Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.

  7. Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    VLAs-as-Tools pairs a VLM planner with specialized VLA executors via a new interface and Tool-Aligned Post-Training to raise long-horizon robot success rates on LIBERO-Long and RoboTwin benchmarks.

  8. G-Zero: Self-Play for Open-Ended Generation from Zero Data

    cs.LG 2026-05 unverdicted novelty 6.0

    G-Zero uses the Hint-δ intrinsic reward to drive co-evolution between a Proposer and Generator via GRPO and DPO, providing a theoretical suboptimality guarantee for self-improvement from internal dynamics alone.

  9. SEIF: Self-Evolving Reinforcement Learning for Instruction Following

    cs.CL 2026-05 conditional novelty 6.0

    SEIF creates a self-reinforcing loop in which an LLM alternately generates increasingly difficult instructions and learns to follow them better using reinforcement learning signals from its own judgments.

  10. AnySlot: Goal-Conditioned Vision-Language-Action Policies for Zero-Shot Slot-Level Placement

    cs.RO 2026-04 unverdicted novelty 6.0

    AnySlot decouples language grounding from low-level control by inserting an explicit visual goal image, yielding better zero-shot performance on precise slot placement tasks than flat VLA policies.

  11. ExpressMM: Expressive Mobile Manipulation Behaviors in Human-Robot Interactions

    cs.RO 2026-04 unverdicted novelty 6.0

    ExpressMM integrates high-level language-guided planning with low-level vision-language-action policies to enable expressive and interruptible mobile manipulation behaviors in human-robot collaboration, shown effectiv...

  12. ThermoAct:Thermal-Aware Vision-Language-Action Models for Robotic Perception and Decision-Making

    cs.RO 2026-03 unverdicted novelty 6.0

    ThermoAct integrates thermal imaging into VLA models via a VLM planner to enable robots to perceive physical properties like heat and improve safety over vision-only systems.

  13. World Action Models are Zero-shot Policies

    cs.RO 2026-02 unverdicted novelty 6.0

    DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...

  14. Steerable Vision-Language-Action Policies for Embodied Reasoning and Hierarchical Control

    cs.RO 2026-02 unverdicted novelty 6.0

    Steerable VLAs trained on rich synthetic commands at subtask, motion, and pixel levels enable VLMs to steer robot behavior more effectively, outperforming prior hierarchical baselines on real-world manipulation and ge...

  15. $\pi^{*}_{0.6}$: a VLA That Learns From Experience

    cs.LG 2025-11 unverdicted novelty 6.0

    RECAP enables a generalist VLA to self-improve via advantage-conditioned RL on mixed real-world data, more than doubling throughput and halving failure rates on hard manipulation tasks.

  16. InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy

    cs.RO 2025-10 unverdicted novelty 6.0

    InternVLA-M1 uses spatially guided pre-training on 2.3M examples followed by action post-training to deliver up to 17% gains on robot manipulation benchmarks and 20.6% on unseen objects.

  17. Ctrl-World: A Controllable Generative World Model for Robot Manipulation

    cs.RO 2025-10 unverdicted novelty 6.0

    A controllable world model trained on the DROID dataset generates consistent multi-view robot trajectories for over 20 seconds and improves generalist policy success rates by 44.7% via imagined trajectory fine-tuning.

  18. HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

    cs.CV 2026-04 unverdicted novelty 5.0

    HiVLA decouples VLM-based semantic planning with visual grounding from a cascaded cross-attention DiT action expert, outperforming end-to-end VLAs on long-horizon and fine-grained manipulation.

  19. HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

    cs.CV 2026-04 unverdicted novelty 5.0

    HiVLA decouples VLM-based semantic planning from DiT-based motor control via structured plans and cascaded cross-attention to outperform end-to-end VLA baselines in long-horizon and fine-grained manipulation.

  20. RoboAgent: Chaining Basic Capabilities for Embodied Task Planning

    cs.RO 2026-04 unverdicted novelty 5.0

    RoboAgent chains basic vision-language capabilities inside a single VLM via a scheduler and trains it in three stages (behavior cloning, DAgger, RL) to improve embodied task planning.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · cited by 19 Pith papers · 13 internal anchors

  1. [1]

    Rt-h: Action hierarchies using language

    Belkhale, S., Ding, T., Xiao, T., Sermanet, P., Vuong, Q., Tompson, J., Chebotar, Y., Dwibedi, D., and Sadigh, D. Rt-h: Action hierarchies using language. arXiv preprint arXiv:2403.01823, 2024

  2. [2]

    PaliGemma: A versatile 3B VLM for transfer

    Beyer, L., Steiner, A., Pinto, A. S., Kolesnikov, A., Wang, X., Salz, D., Neumann, M., Alabdulmohsin, I., Tschannen, M., Bugliarello, E., et al. Paligemma: A versatile 3b vlm for transfer. arXiv preprint arXiv:2407.07726, 2024

  3. [3]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al. _0 : A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

  4. [4]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., Gopalakrishnan, K., Hausman, K., Herzog, A., Hsu, J., et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022

  5. [5]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Chen, X., Choromanski, K., Ding, T., Driess, D., Dubey, A., Finn, C., et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023 a

  6. [6]

    Do as i can, not as i say: Grounding language in robotic affordances

    Brohan, A., Chebotar, Y., Finn, C., Hausman, K., Herzog, A., Ho, D., Ibarz, J., Irpan, A., Jang, E., Julian, R., et al. Do as i can, not as i say: Grounding language in robotic affordances. In Conference on robot learning, pp.\ 287--318. PMLR, 2023 b

  7. [7]

    Automating robot failure recovery using vision-language models with optimized prompts

    Chen, H., Yao, Y., Liu, R., Liu, C., and Ichnowski, J. Automating robot failure recovery using vision-language models with optimized prompts. arXiv preprint arXiv:2409.03966, 2024

  8. [8]

    Diffusion policy: Visuomotor policy learning via action diffusion

    Chi, C., Feng, S., Du, Y., Xu, Z., Cousineau, E., Burchfiel, B., and Song, S. Diffusion policy: Visuomotor policy learning via action diffusion. In Proceedings of Robotics: Science and Systems (RSS), 2023

  9. [9]

    Racer: Rich language-guided failure recovery policies for imitation learning

    Dai, Y., Lee, J., Fazeli, N., and Chai, J. Racer: Rich language-guided failure recovery policies for imitation learning. arXiv preprint arXiv:2409.14674, 2024

  10. [10]

    PaLM-E: An Embodied Multimodal Language Model

    Driess, D., Xia, F., Sajjadi, M. S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023

  11. [11]

    Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation

    Fu, Z., Zhao, T. Z., and Finn, C. Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation. arXiv preprint arXiv:2401.02117, 2024

  12. [12]

    Look before you leap: Unveiling the power of gpt-4v in robotic vision-language planning, 2023

    Hu, Y., Lin, F., Zhang, T., Yi, L., and Gao, Y. Look before you leap: Unveiling the power of gpt-4v in robotic vision-language planning, 2023. URL https://arxiv.org/abs/2311.17842

  13. [13]

    Language models as zero-shot planners: Extracting actionable knowledge for embodied agents

    Huang, W., Abbeel, P., Pathak, D., and Mordatch, I. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In International conference on machine learning, pp.\ 9118--9147. PMLR, 2022

  14. [14]

    VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

    Huang, W., Wang, C., Zhang, R., Li, Y., Wu, J., and Fei-Fei, L. Voxposer: Composable 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973, 2023

  15. [15]

    Bc-z: Zero-shot task generalization with robotic imitation learning

    Jang, E., Irpan, A., Khansari, M., Kappler, D., Ebert, F., Lynch, C., Levine, S., and Finn, C. Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning, pp.\ 991--1002. PMLR, 2022

  16. [16]

    Thinking, fast and slow

    Kahneman, D. Thinking, fast and slow. Farrar, Straus and Giroux, New York, 2011. ISBN 9780374275631 0374275637

  17. [17]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Kim, M. J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., et al. Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246, 2024

  18. [18]

    Interactive task planning with language models, 2025 a

    Li, B., Wu, P., Abbeel, P., and Malik, J. Interactive task planning with language models, 2025 a . URL https://arxiv.org/abs/2310.10645

  19. [19]

    CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

    Li, Q., Liang, Y., Wang, Z., Luo, L., Chen, X., Liao, M., Wei, F., Deng, Y., Xu, S., Zhang, Y., et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation. arXiv preprint arXiv:2411.19650, 2024

  20. [20]

    R., Ramos, F., Fox, D., Li, A., Gupta, A., and Goyal, A

    Li, Y., Deng, Y., Zhang, J., Jang, J., Memmel, M., Yu, R., Garrett, C. R., Ramos, F., Fox, D., Li, A., Gupta, A., and Goyal, A. Hamster: Hierarchical action models for open-world robot manipulation, 2025 b . URL https://arxiv.org/abs/2502.05485

  21. [21]

    Code as policies: Language model programs for embodied control

    Liang, J., Huang, W., Xia, F., Xu, P., Hausman, K., Ichter, B., Florence, P., and Zeng, A. Code as policies: Language model programs for embodied control. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp.\ 9493--9500. IEEE, 2023

  22. [22]

    Moka: Open-vocabulary robotic manipulation through mark-based visual prompting

    Liu, F., Fang, K., Abbeel, P., and Levine, S. Moka: Open-vocabulary robotic manipulation through mark-based visual prompting. In First Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024, 2024 a

  23. [23]

    Interactive robot learning from verbal correction

    Liu, H., Chen, A., Zhu, Y., Swaminathan, A., Kolobov, A., and Cheng, C.-A. Interactive robot learning from verbal correction. arXiv preprint arXiv:2310.17555, 2023

  24. [24]

    Liu, P., Orru, Y., Vakil, J., Paxton, C., Shafiullah, N. M. M., and Pinto, L. Ok-robot: What really matters in integrating open-knowledge models for robotics. arXiv preprint arXiv:2401.12202, 2024 b

  25. [25]

    RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

    Liu, S., Wu, L., Li, B., Tan, H., Chen, H., Wang, Z., Xu, K., Su, H., and Zhu, J. Rdt-1b: a diffusion foundation model for bimanual manipulation. arXiv preprint arXiv:2410.07864, 2024 c

  26. [26]

    and Hutter, F

    Loshchilov, I. and Hutter, F. Decoupled weight decay regularization, 2017

  27. [27]

    Learning to parse natural language commands to a robot control system

    Matuszek, C., Herbst, E., Zettlemoyer, L., and Fox, D. Learning to parse natural language commands to a robot control system. In Experimental Robotics: The 13th International Symposium on Experimental Robotics, volume 88, pp.\ 403. Springer, 2013

  28. [28]

    Is feedback all you need? leveraging natural language feedback in goal-conditioned rl

    McCallum, S., Taylor-Davies, M., Albrecht, S., and Suglia, A. Is feedback all you need? leveraging natural language feedback in goal-conditioned rl. In NeurIPS 2023 Workshop on Goal-Conditioned Reinforcement Learning

  29. [29]

    Learning neuro-symbolic programs for language guided robot manipulation

    Namasivayam, K., Singh, H., Bindal, V., Tuli, A., Agrawal, V., Jain, R., Singla, P., and Paul, R. Learning neuro-symbolic programs for language guided robot manipulation. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp.\ 7973--7980. IEEE, 2023

  30. [30]

    Pivot: Iterative visual prompting elicits actionable knowledge for vlms

    Nasiriany, S., Xia, F., Yu, W., Xiao, T., Liang, J., Dasgupta, I., Xie, A., Driess, D., Wahid, A., Xu, Z., et al. Pivot: Iterative visual prompting elicits actionable knowledge for vlms. arXiv preprint arXiv:2402.07872, 2024

  31. [31]

    Y., Sanketi, P., Vuong, Q., Xiao, T., Sadigh, D., Finn, C., and Levine, S

    Octo Model Team , Ghosh, D., Walke, H., Pertsch, K., Black, K., Mees, O., Dasari, S., Hejna, J., Xu, C., Luo, J., Kreiman, T., Tan, Y., Chen, L. Y., Sanketi, P., Vuong, Q., Xiao, T., Sadigh, D., Finn, C., and Levine, S. Octo: An open-source generalist robot policy. In Proceedings of Robotics: Science and Systems, Delft, Netherlands, 2024

  32. [32]

    Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0

    O’Neill, A., Rehman, A., Maddukuri, A., Gupta, A., Padalkar, A., Lee, A., Pooley, A., Gupta, A., Mandlekar, A., Jain, A., et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pp.\ 6892--6903. IEEE, 2024

  33. [33]

    F., Walter, M

    Patki, S., Daniele, A. F., Walter, M. R., and Howard, T. M. Inferring compact representations for efficient natural language understanding of robot instructions. In 2019 International Conference on Robotics and Automation (ICRA), pp.\ 6926--6933. IEEE, 2019

  34. [34]

    FAST: Efficient Action Tokenization for Vision-Language-Action Models

    Pertsch, K., Stachowicz, K., Ichter, B., Driess, D., Nair, S., Vuong, Q., Mees, O., Finn, C., and Levine, S. Fast: Efficient action tokenization for vision-language-action models. arXiv preprint arXiv:2501.09747, 2025

  35. [35]

    Open-vocabulary mobile manipulation in unseen dynamic environments with 3d semantic maps

    Qiu, D., Ma, W., Pan, Z., Xiong, H., and Liang, J. Open-vocabulary mobile manipulation in unseen dynamic environments with 3d semantic maps. arXiv preprint arXiv:2406.18115, 2024

  36. [36]

    W., Xu, T., Brockman, G., McLeavey, C., and Sutskever, I

    Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., and Sutskever, I. Robust speech recognition via large-scale weak supervision. In International conference on machine learning, pp.\ 28492--28518. PMLR, 2023

  37. [37]

    Bumble: Unifying reasoning and acting with vision-language models for building-wide mobile manipulation

    Shah, R., Yu, A., Zhu, Y., Zhu, Y., and Mart \' n-Mart \' n, R. Bumble: Unifying reasoning and acting with vision-language models for building-wide mobile manipulation. arXiv preprint arXiv:2410.06237, 2024

  38. [38]

    X., Hu, Z., Zhao, T

    Shi, L. X., Hu, Z., Zhao, T. Z., Sharma, A., Pertsch, K., Luo, J., Levine, S., and Finn, C. Yell at your robot: Improving on-the-fly from language corrections. arXiv preprint arXiv:2403.12910, 2024

  39. [39]

    Progprompt: Generating situated robot task plans using large language models

    Singh, I., Blukis, V., Mousavian, A., Goyal, A., Xu, D., Tremblay, J., Fox, D., Thomason, J., and Garg, A. Progprompt: Generating situated robot task plans using large language models. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp.\ 11523--11530. IEEE, 2023

  40. [40]

    Singh, U., Bhattacharyya, P., and Namboodiri, V. P. Lgr2: Language guided reward relabeling for accelerating hierarchical reinforcement learning. arXiv preprint arXiv:2406.05881, 2024

  41. [41]

    S., Hsu, S., Sharma, A., and Finn, C

    Stephan, M., Khazatsky, A., Mitchell, E., Chen, A. S., Hsu, S., Sharma, A., and Finn, C. Rlvf: Learning from verbal feedback without overgeneralization. arXiv preprint arXiv:2402.10893, 2024

  42. [42]

    Language-conditioned imitation learning for robot manipulation tasks

    Stepputtis, S., Campbell, J., Phielipp, M., Lee, S., Baral, C., and Ben Amor, H. Language-conditioned imitation learning for robot manipulation tasks. Advances in Neural Information Processing Systems, 33: 0 13139--13150, 2020

  43. [43]

    Open-world object manipulation using pre-trained vision-language models

    Stone, A., Xiao, T., Lu, Y., Gopalakrishnan, K., Lee, K.-H., Vuong, Q., Wohlhart, P., Kirmani, S., Zitkovich, B., Xia, F., et al. Open-world object manipulation using pre-trained vision-language models. arXiv preprint arXiv:2303.00905, 2023

  44. [44]

    A computational model for the alignment of hierarchical scene representations in human-robot interaction

    Swadzba, A., Vorwerg, C., Wachsmuth, S., and Rickheit, G. A computational model for the alignment of hierarchical scene representations in human-robot interaction. In Twenty-First International Joint Conference on Artificial Intelligence. Citeseer, 2009

  45. [45]

    N., Zhu, S.-C., and Liu, H

    Wang, S., Han, M., Jiao, Z., Zhang, Z., Wu, Y. N., Zhu, S.-C., and Liu, H. Llm\^ 3: Large language model-based task and motion planning with motion failure reasoning. arXiv preprint arXiv:2403.11552, 2024

  46. [46]

    Tinyvla: Towards fast, data-efficient vision- language-action models for robotic manipulation

    Wen, J., Zhu, Y., Li, J., Zhu, M., Wu, K., Xu, Z., Liu, N., Cheng, R., Shen, C., Peng, Y., et al. Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation. arXiv preprint arXiv:2409.12514, 2024

  47. [47]

    Robi butler: Remote multimodal interactions with household robot assistant

    Xiao, A., Janaka, N., Hu, T., Gupta, A., Li, K., Yu, C., and Hsu, D. Robi butler: Remote multimodal interactions with household robot assistant. arXiv preprint arXiv:2409.20548, 2024

  48. [48]

    Robotic Control via Embodied Chain-of-Thought Reasoning

    Zawalski, M., Chen, W., Pertsch, K., Mees, O., Finn, C., and Levine, S. Robotic control via embodied chain-of-thought reasoning. arXiv preprint arXiv:2407.08693, 2024

  49. [49]

    Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

    Zhao, T. Z., Kumar, V., Levine, S., and Finn, C. Learning fine-grained bimanual manipulation with low-cost hardware. arXiv preprint arXiv:2304.13705, 2023

  50. [50]

    Universal actions for enhanced embodied foundation models

    Zheng, J., Li, J., Liu, D., Zheng, Y., Wang, Z., Ou, Z., Liu, Y., Liu, J., Zhang, Y.-Q., and Zhan, X. Universal actions for enhanced embodied foundation models. arXiv preprint arXiv:2501.10105, 2025

  51. [51]

    Closed-loop open-vocabulary mobile manipulation with gpt-4v

    Zhi, P., Zhang, Z., Han, M., Zhang, Z., Li, Z., Jiao, Z., Jia, B., and Huang, S. Closed-loop open-vocabulary mobile manipulation with gpt-4v. arXiv preprint arXiv:2404.10220, 2024