arxiv: 2502.19417 · v2 · submitted 2025-02-26 · 💻 cs.RO · cs.AI· cs.LG

Recognition: 1 theorem link

Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models

Lucy Xiaoyang Shi , Brian Ichter , Michael Equi , Liyiming Ke , Karl Pertsch , Quan Vuong , James Tanner , Anna Walling

show 7 more authors

Haohuan Wang Niccolo Fusai Adrian Li-Bell Danny Driess Lachy Groom Sergey Levine Chelsea Finn

Authors on Pith no claims yet

Pith reviewed 2026-05-15 22:49 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.LG

keywords hierarchical modelsvision-language-actionopen-ended instruction followingsituated feedbackgeneralist robotsrobotic manipulationmulti-platform evaluation

0 comments

The pith

A hierarchical vision-language model lets robots interpret complex instructions and real-time feedback to choose and perform next steps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes a two-level system in which a vision-language model first reasons over open-ended natural-language commands and situated comments to decide the appropriate next action, after which a low-level policy carries out the physical motion. This structure is shown to succeed on tasks such as preparing a vegetarian sandwich, clearing a table, and grocery shopping, where simple direct commands would fail. The approach is demonstrated on single-arm, dual-arm, and mobile dual-arm robots without task-specific reprogramming. A reader would care because it moves generalist robots closer to handling ambiguous human requests in everyday environments rather than requiring perfectly scripted inputs.

Core claim

The paper claims that separating high-level reasoning from low-level control through a vision-language model allows a robot to process intricate prompts and incorporate corrective feedback during execution, enabling it to complete multi-step tasks that direct instruction-following methods cannot handle.

What carries the argument

Hierarchical vision-language-action model: a high-level VLM maps language and visual feedback to the next sub-goal, while a separate low-level policy translates that sub-goal into robot actions.

If this is right

Robots can now respond to verbal corrections mid-task instead of requiring all instructions upfront.
The same high-level model can be reused across single-arm, dual-arm, and mobile platforms with only the low-level policy swapped.
Tasks that combine object manipulation with semantic understanding, such as distinguishing trash from food, become executable without custom code for each scenario.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The architecture may reduce the amount of robot-specific demonstration data needed for new tasks by leveraging the pre-trained vision-language model's reasoning.
Extending the high-level layer to longer-horizon planning could allow robots to generate entire task sequences from a single high-level goal.
Real-time feedback integration opens the possibility of safer shared workspaces where humans can verbally redirect the robot without physical intervention.

Load-bearing premise

The high-level vision-language model reliably converts open-ended instructions and visual feedback into correct next-step decisions without misinterpreting context or inventing invalid actions.

What would settle it

Run the robot on a table-cleaning task with an item the user labels 'that's not trash' and observe whether it correctly avoids removing that item while still clearing the rest of the table.

read the original abstract

Generalist robots that can perform a range of different tasks in open-world settings must be able to not only reason about the steps needed to accomplish their goals, but also process complex instructions, prompts, and even feedback during task execution. Intricate instructions (e.g., "Could you make me a vegetarian sandwich?" or "I don't like that one") require not just the ability to physically perform the individual steps, but the ability to situate complex commands and feedback in the physical world. In this work, we describe a system that uses vision-language models in a hierarchical structure, first reasoning over complex prompts and user feedback to deduce the most appropriate next step to fulfill the task, and then performing that step with low-level actions. In contrast to direct instruction following methods that can fulfill simple commands ("pick up the cup"), our system can reason through complex prompts and incorporate situated feedback during task execution ("that's not trash"). We evaluate our system across three robotic platforms, including single-arm, dual-arm, and dual-arm mobile robots, demonstrating its ability to handle tasks such as cleaning messy tables, making sandwiches, and grocery shopping. Videos are available at https://www.pi.website/research/hirobot

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Hierarchical VLA for open-ended instructions works in demos across platforms but needs quantitative validation on feedback handling.

read the letter

The paper's key point is this hierarchical approach to robot control where a vision-language model first figures out the next step from tricky instructions or feedback, and then a separate model does the actual movements. They test it on three different robots and show videos of it handling things like making a sandwich or shopping with corrections. It's new in combining the high-level reasoning with low-level actions in a way that explicitly uses feedback during execution, which prior flat VLA models don't do as directly. The multi-platform evaluation is a plus, covering single arm to mobile setups. The paper does a decent job laying out the system and showing real-world applicability through the demos. The tasks are chosen to highlight the open-ended aspect. Where it could be stronger is the evaluation. The abstract and description don't include any quantitative results like success rates or comparisons. The stress on feedback incorporation isn't backed by measurements of how accurately the VLM handles the visual and language input in context. That makes it tough to assess if the claims hold up beyond the selected videos. For readers, this is useful for anyone trying to build robots that follow natural language in unstructured settings. It has enough of a concrete system to merit peer review, though reviewers will likely ask for more metrics and ablations. I think it's worth discussing in a group to see the details of the hierarchy. I wouldn't cite it yet. Definitely send to referees.

Referee Report

2 major / 2 minor

Summary. The paper presents Hi Robot, a hierarchical vision-language-action architecture in which a high-level VLM first interprets open-ended natural-language instructions and situated visual feedback to select the next appropriate step, after which low-level action models execute the chosen primitive. The system is demonstrated qualitatively across three robot platforms (single-arm, dual-arm, and mobile dual-arm) on tasks such as table cleaning, sandwich assembly, and grocery shopping, with emphasis on its ability to incorporate corrective feedback such as “that’s not trash.”

Significance. A reliably working hierarchical decomposition could advance generalist robotics by enabling handling of nuanced, context-dependent instructions that direct VLM prompting struggles with. The multi-platform evaluation suggests some degree of transferability, yet the complete absence of quantitative metrics prevents any assessment of how large or consistent the claimed advantage actually is.

major comments (2)

[Evaluation] Evaluation section: the central claim that the high-level VLM “can reason through complex prompts and incorporate situated feedback” rests entirely on qualitative video demonstrations; no success rates, error rates, confusion matrices, or controlled tests of feedback incorporation (e.g., accuracy on prompts containing “that’s not trash”) are reported, leaving the robustness assumption unmeasured.
[Evaluation] Evaluation section: no baseline comparison to direct (non-hierarchical) VLM instruction following is provided, so the asserted superiority of the hierarchical structure cannot be quantified or even verified against the simpler alternative the paper contrasts with.

minor comments (2)

[Implementation details] The manuscript should state the exact VLM checkpoints and prompting templates used for the high-level reasoner so that the qualitative results can be reproduced.
[Figures and videos] Figure captions and video descriptions should explicitly link each clip to the specific feedback-handling behavior being illustrated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the potential of hierarchical decomposition for handling nuanced instructions and feedback. We agree that stronger quantitative evaluation is needed and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the central claim that the high-level VLM “can reason through complex prompts and incorporate situated feedback” rests entirely on qualitative video demonstrations; no success rates, error rates, confusion matrices, or controlled tests of feedback incorporation (e.g., accuracy on prompts containing “that’s not trash”) are reported, leaving the robustness assumption unmeasured.

Authors: We agree that the current evaluation is primarily qualitative. In the revised manuscript we will add quantitative success rates obtained from repeated trials on the demonstrated tasks (table cleaning, sandwich assembly, grocery shopping) across the three platforms. We will also include a controlled test measuring the high-level VLM’s accuracy in correctly updating the plan when given corrective feedback phrases such as “that’s not trash.” revision: yes
Referee: [Evaluation] Evaluation section: no baseline comparison to direct (non-hierarchical) VLM instruction following is provided, so the asserted superiority of the hierarchical structure cannot be quantified or even verified against the simpler alternative the paper contrasts with.

Authors: We acknowledge the absence of a direct baseline comparison. We will add experiments that run the identical tasks using direct (non-hierarchical) VLM prompting and report comparative success rates, thereby quantifying the advantage of the hierarchical separation of high-level reasoning from low-level control. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system description without derivation chain

full rationale

The paper describes a hierarchical VLM-based robotic control system for open-ended instructions and feedback, evaluated via qualitative demonstrations on three platforms for tasks such as sandwich-making and grocery shopping. No equations, fitted parameters, uniqueness theorems, or self-citations that reduce claims to inputs appear in the provided text. The central claims rest on described experiments rather than any mathematical reduction or self-referential construction, satisfying the default expectation of no significant circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach assumes pre-trained VLMs possess sufficient grounded reasoning to map language and images to valid next steps without additional training or verification mechanisms described.

axioms (1)

domain assumption Pre-trained vision-language models can accurately deduce appropriate next physical steps from complex natural-language instructions and visual feedback.
Invoked in the description of the high-level reasoning layer.

pith-pipeline@v0.9.0 · 5565 in / 1097 out tokens · 50966 ms · 2026-05-15T22:49:39.001717+00:00 · methodology

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Characterizing Vision-Language-Action Models across XPUs: Constraints and Acceleration for On-Robot Deployment
cs.RO 2026-04 unverdicted novelty 7.0

VLA models exhibit a compute-bound VLM phase followed by a memory-bound action phase on edge hardware; DP-Cache and V-AEFusion reduce redundancy and enable pipeline parallelism for up to 6x speedup on NPUs with margin...
${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities
cs.LG 2026-04 unverdicted novelty 7.0

π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.
QuadAgent: A Responsive Agent System for Vision-Language Guided Quadrotor Agile Flight
cs.RO 2026-04 unverdicted novelty 7.0

QuadAgent uses an asynchronous multi-agent architecture with an Impression Graph for scene memory and vision-based avoidance to enable training-free vision-language guided agile quadrotor flight, outperforming baselin...
VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models
cs.RO 2026-03 unverdicted novelty 7.0

VP-VLA decouples high-level reasoning from low-level control in VLA models by rendering spatial anchors as visual prompts directly in the RGB observation space, outperforming end-to-end baselines.
UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models
cs.RO 2026-02 unverdicted novelty 7.0

UniLACT improves VLA models by adding depth-aware unified latent action pretraining that outperforms RGB-only baselines on seen and unseen manipulation tasks.
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory
cs.CL 2025-11 unverdicted novelty 7.0

Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.
Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 6.0

VLAs-as-Tools pairs a VLM planner with specialized VLA executors via a new interface and Tool-Aligned Post-Training to raise long-horizon robot success rates on LIBERO-Long and RoboTwin benchmarks.
G-Zero: Self-Play for Open-Ended Generation from Zero Data
cs.LG 2026-05 unverdicted novelty 6.0

G-Zero uses the Hint-δ intrinsic reward to drive co-evolution between a Proposer and Generator via GRPO and DPO, providing a theoretical suboptimality guarantee for self-improvement from internal dynamics alone.
SEIF: Self-Evolving Reinforcement Learning for Instruction Following
cs.CL 2026-05 conditional novelty 6.0

SEIF creates a self-reinforcing loop in which an LLM alternately generates increasingly difficult instructions and learns to follow them better using reinforcement learning signals from its own judgments.
AnySlot: Goal-Conditioned Vision-Language-Action Policies for Zero-Shot Slot-Level Placement
cs.RO 2026-04 unverdicted novelty 6.0

AnySlot decouples language grounding from low-level control by inserting an explicit visual goal image, yielding better zero-shot performance on precise slot placement tasks than flat VLA policies.
ExpressMM: Expressive Mobile Manipulation Behaviors in Human-Robot Interactions
cs.RO 2026-04 unverdicted novelty 6.0

ExpressMM integrates high-level language-guided planning with low-level vision-language-action policies to enable expressive and interruptible mobile manipulation behaviors in human-robot collaboration, shown effectiv...
ThermoAct:Thermal-Aware Vision-Language-Action Models for Robotic Perception and Decision-Making
cs.RO 2026-03 unverdicted novelty 6.0

ThermoAct integrates thermal imaging into VLA models via a VLM planner to enable robots to perceive physical properties like heat and improve safety over vision-only systems.
World Action Models are Zero-shot Policies
cs.RO 2026-02 unverdicted novelty 6.0

DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...
Steerable Vision-Language-Action Policies for Embodied Reasoning and Hierarchical Control
cs.RO 2026-02 unverdicted novelty 6.0

Steerable VLAs trained on rich synthetic commands at subtask, motion, and pixel levels enable VLMs to steer robot behavior more effectively, outperforming prior hierarchical baselines on real-world manipulation and ge...
$\pi^{*}_{0.6}$: a VLA That Learns From Experience
cs.LG 2025-11 unverdicted novelty 6.0

RECAP enables a generalist VLA to self-improve via advantage-conditioned RL on mixed real-world data, more than doubling throughput and halving failure rates on hard manipulation tasks.
InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy
cs.RO 2025-10 unverdicted novelty 6.0

InternVLA-M1 uses spatially guided pre-training on 2.3M examples followed by action post-training to deliver up to 17% gains on robot manipulation benchmarks and 20.6% on unseen objects.
Ctrl-World: A Controllable Generative World Model for Robot Manipulation
cs.RO 2025-10 unverdicted novelty 6.0

A controllable world model trained on the DROID dataset generates consistent multi-view robot trajectories for over 20 seconds and improves generalist policy success rates by 44.7% via imagined trajectory fine-tuning.
HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System
cs.CV 2026-04 unverdicted novelty 5.0

HiVLA decouples VLM-based semantic planning with visual grounding from a cascaded cross-attention DiT action expert, outperforming end-to-end VLAs on long-horizon and fine-grained manipulation.
HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System
cs.CV 2026-04 unverdicted novelty 5.0

HiVLA decouples VLM-based semantic planning from DiT-based motor control via structured plans and cascaded cross-attention to outperform end-to-end VLA baselines in long-horizon and fine-grained manipulation.
RoboAgent: Chaining Basic Capabilities for Embodied Task Planning
cs.RO 2026-04 unverdicted novelty 5.0

RoboAgent chains basic vision-language capabilities inside a single VLM via a scheduler and trains it in three stages (behavior cloning, DAgger, RL) to improve embodied task planning.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · cited by 19 Pith papers · 13 internal anchors

[1]

Rt-h: Action hierarchies using language

Belkhale, S., Ding, T., Xiao, T., Sermanet, P., Vuong, Q., Tompson, J., Chebotar, Y., Dwibedi, D., and Sadigh, D. Rt-h: Action hierarchies using language. arXiv preprint arXiv:2403.01823, 2024

work page arXiv 2024
[2]

PaliGemma: A versatile 3B VLM for transfer

Beyer, L., Steiner, A., Pinto, A. S., Kolesnikov, A., Wang, X., Salz, D., Neumann, M., Alabdulmohsin, I., Tschannen, M., Bugliarello, E., et al. Paligemma: A versatile 3b vlm for transfer. arXiv preprint arXiv:2407.07726, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al. _0 : A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

RT-1: Robotics Transformer for Real-World Control at Scale

Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., Gopalakrishnan, K., Hausman, K., Herzog, A., Hsu, J., et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[5]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Chen, X., Choromanski, K., Ding, T., Driess, D., Dubey, A., Finn, C., et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023 a

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Do as i can, not as i say: Grounding language in robotic affordances

Brohan, A., Chebotar, Y., Finn, C., Hausman, K., Herzog, A., Ho, D., Ibarz, J., Irpan, A., Jang, E., Julian, R., et al. Do as i can, not as i say: Grounding language in robotic affordances. In Conference on robot learning, pp.\ 287--318. PMLR, 2023 b

work page 2023
[7]

Automating robot failure recovery using vision-language models with optimized prompts

Chen, H., Yao, Y., Liu, R., Liu, C., and Ichnowski, J. Automating robot failure recovery using vision-language models with optimized prompts. arXiv preprint arXiv:2409.03966, 2024

work page arXiv 2024
[8]

Diffusion policy: Visuomotor policy learning via action diffusion

Chi, C., Feng, S., Du, Y., Xu, Z., Cousineau, E., Burchfiel, B., and Song, S. Diffusion policy: Visuomotor policy learning via action diffusion. In Proceedings of Robotics: Science and Systems (RSS), 2023

work page 2023
[9]

Racer: Rich language-guided failure recovery policies for imitation learning

Dai, Y., Lee, J., Fazeli, N., and Chai, J. Racer: Rich language-guided failure recovery policies for imitation learning. arXiv preprint arXiv:2409.14674, 2024

work page arXiv 2024
[10]

PaLM-E: An Embodied Multimodal Language Model

Driess, D., Xia, F., Sajjadi, M. S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation

Fu, Z., Zhao, T. Z., and Finn, C. Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation. arXiv preprint arXiv:2401.02117, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Look before you leap: Unveiling the power of gpt-4v in robotic vision-language planning, 2023

Hu, Y., Lin, F., Zhang, T., Yi, L., and Gao, Y. Look before you leap: Unveiling the power of gpt-4v in robotic vision-language planning, 2023. URL https://arxiv.org/abs/2311.17842

work page arXiv 2023
[13]

Language models as zero-shot planners: Extracting actionable knowledge for embodied agents

Huang, W., Abbeel, P., Pathak, D., and Mordatch, I. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In International conference on machine learning, pp.\ 9118--9147. PMLR, 2022

work page 2022
[14]

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

Huang, W., Wang, C., Zhang, R., Li, Y., Wu, J., and Fei-Fei, L. Voxposer: Composable 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Bc-z: Zero-shot task generalization with robotic imitation learning

Jang, E., Irpan, A., Khansari, M., Kappler, D., Ebert, F., Lynch, C., Levine, S., and Finn, C. Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning, pp.\ 991--1002. PMLR, 2022

work page 2022
[16]

Thinking, fast and slow

Kahneman, D. Thinking, fast and slow. Farrar, Straus and Giroux, New York, 2011. ISBN 9780374275631 0374275637

work page 2011
[17]

OpenVLA: An Open-Source Vision-Language-Action Model

Kim, M. J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., et al. Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Interactive task planning with language models, 2025 a

Li, B., Wu, P., Abbeel, P., and Malik, J. Interactive task planning with language models, 2025 a . URL https://arxiv.org/abs/2310.10645

work page arXiv 2025
[19]

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

Li, Q., Liang, Y., Wang, Z., Luo, L., Chen, X., Liao, M., Wei, F., Deng, Y., Xu, S., Zhang, Y., et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation. arXiv preprint arXiv:2411.19650, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

R., Ramos, F., Fox, D., Li, A., Gupta, A., and Goyal, A

Li, Y., Deng, Y., Zhang, J., Jang, J., Memmel, M., Yu, R., Garrett, C. R., Ramos, F., Fox, D., Li, A., Gupta, A., and Goyal, A. Hamster: Hierarchical action models for open-world robot manipulation, 2025 b . URL https://arxiv.org/abs/2502.05485

work page arXiv 2025
[21]

Code as policies: Language model programs for embodied control

Liang, J., Huang, W., Xia, F., Xu, P., Hausman, K., Ichter, B., Florence, P., and Zeng, A. Code as policies: Language model programs for embodied control. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp.\ 9493--9500. IEEE, 2023

work page 2023
[22]

Moka: Open-vocabulary robotic manipulation through mark-based visual prompting

Liu, F., Fang, K., Abbeel, P., and Levine, S. Moka: Open-vocabulary robotic manipulation through mark-based visual prompting. In First Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024, 2024 a

work page 2024
[23]

Interactive robot learning from verbal correction

Liu, H., Chen, A., Zhu, Y., Swaminathan, A., Kolobov, A., and Cheng, C.-A. Interactive robot learning from verbal correction. arXiv preprint arXiv:2310.17555, 2023

work page arXiv 2023
[24]

Liu, P., Orru, Y., Vakil, J., Paxton, C., Shafiullah, N. M. M., and Pinto, L. Ok-robot: What really matters in integrating open-knowledge models for robotics. arXiv preprint arXiv:2401.12202, 2024 b

work page arXiv 2024
[25]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Liu, S., Wu, L., Li, B., Tan, H., Chen, H., Wang, Z., Xu, K., Su, H., and Zhu, J. Rdt-1b: a diffusion foundation model for bimanual manipulation. arXiv preprint arXiv:2410.07864, 2024 c

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

and Hutter, F

Loshchilov, I. and Hutter, F. Decoupled weight decay regularization, 2017

work page 2017
[27]

Learning to parse natural language commands to a robot control system

Matuszek, C., Herbst, E., Zettlemoyer, L., and Fox, D. Learning to parse natural language commands to a robot control system. In Experimental Robotics: The 13th International Symposium on Experimental Robotics, volume 88, pp.\ 403. Springer, 2013

work page 2013
[28]

Is feedback all you need? leveraging natural language feedback in goal-conditioned rl

McCallum, S., Taylor-Davies, M., Albrecht, S., and Suglia, A. Is feedback all you need? leveraging natural language feedback in goal-conditioned rl. In NeurIPS 2023 Workshop on Goal-Conditioned Reinforcement Learning

work page 2023
[29]

Learning neuro-symbolic programs for language guided robot manipulation

Namasivayam, K., Singh, H., Bindal, V., Tuli, A., Agrawal, V., Jain, R., Singla, P., and Paul, R. Learning neuro-symbolic programs for language guided robot manipulation. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp.\ 7973--7980. IEEE, 2023

work page 2023
[30]

Pivot: Iterative visual prompting elicits actionable knowledge for vlms

Nasiriany, S., Xia, F., Yu, W., Xiao, T., Liang, J., Dasgupta, I., Xie, A., Driess, D., Wahid, A., Xu, Z., et al. Pivot: Iterative visual prompting elicits actionable knowledge for vlms. arXiv preprint arXiv:2402.07872, 2024

work page arXiv 2024
[31]

Y., Sanketi, P., Vuong, Q., Xiao, T., Sadigh, D., Finn, C., and Levine, S

Octo Model Team , Ghosh, D., Walke, H., Pertsch, K., Black, K., Mees, O., Dasari, S., Hejna, J., Xu, C., Luo, J., Kreiman, T., Tan, Y., Chen, L. Y., Sanketi, P., Vuong, Q., Xiao, T., Sadigh, D., Finn, C., and Levine, S. Octo: An open-source generalist robot policy. In Proceedings of Robotics: Science and Systems, Delft, Netherlands, 2024

work page 2024
[32]

Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0

O’Neill, A., Rehman, A., Maddukuri, A., Gupta, A., Padalkar, A., Lee, A., Pooley, A., Gupta, A., Mandlekar, A., Jain, A., et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pp.\ 6892--6903. IEEE, 2024

work page 2024
[33]

F., Walter, M

Patki, S., Daniele, A. F., Walter, M. R., and Howard, T. M. Inferring compact representations for efficient natural language understanding of robot instructions. In 2019 International Conference on Robotics and Automation (ICRA), pp.\ 6926--6933. IEEE, 2019

work page 2019
[34]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Pertsch, K., Stachowicz, K., Ichter, B., Driess, D., Nair, S., Vuong, Q., Mees, O., Finn, C., and Levine, S. Fast: Efficient action tokenization for vision-language-action models. arXiv preprint arXiv:2501.09747, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Open-vocabulary mobile manipulation in unseen dynamic environments with 3d semantic maps

Qiu, D., Ma, W., Pan, Z., Xiong, H., and Liang, J. Open-vocabulary mobile manipulation in unseen dynamic environments with 3d semantic maps. arXiv preprint arXiv:2406.18115, 2024

work page arXiv 2024
[36]

W., Xu, T., Brockman, G., McLeavey, C., and Sutskever, I

Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., and Sutskever, I. Robust speech recognition via large-scale weak supervision. In International conference on machine learning, pp.\ 28492--28518. PMLR, 2023

work page 2023
[37]

Bumble: Unifying reasoning and acting with vision-language models for building-wide mobile manipulation

Shah, R., Yu, A., Zhu, Y., Zhu, Y., and Mart \' n-Mart \' n, R. Bumble: Unifying reasoning and acting with vision-language models for building-wide mobile manipulation. arXiv preprint arXiv:2410.06237, 2024

work page arXiv 2024
[38]

X., Hu, Z., Zhao, T

Shi, L. X., Hu, Z., Zhao, T. Z., Sharma, A., Pertsch, K., Luo, J., Levine, S., and Finn, C. Yell at your robot: Improving on-the-fly from language corrections. arXiv preprint arXiv:2403.12910, 2024

work page arXiv 2024
[39]

Progprompt: Generating situated robot task plans using large language models

Singh, I., Blukis, V., Mousavian, A., Goyal, A., Xu, D., Tremblay, J., Fox, D., Thomason, J., and Garg, A. Progprompt: Generating situated robot task plans using large language models. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp.\ 11523--11530. IEEE, 2023

work page 2023
[40]

Singh, U., Bhattacharyya, P., and Namboodiri, V. P. Lgr2: Language guided reward relabeling for accelerating hierarchical reinforcement learning. arXiv preprint arXiv:2406.05881, 2024

work page arXiv 2024
[41]

S., Hsu, S., Sharma, A., and Finn, C

Stephan, M., Khazatsky, A., Mitchell, E., Chen, A. S., Hsu, S., Sharma, A., and Finn, C. Rlvf: Learning from verbal feedback without overgeneralization. arXiv preprint arXiv:2402.10893, 2024

work page arXiv 2024
[42]

Language-conditioned imitation learning for robot manipulation tasks

Stepputtis, S., Campbell, J., Phielipp, M., Lee, S., Baral, C., and Ben Amor, H. Language-conditioned imitation learning for robot manipulation tasks. Advances in Neural Information Processing Systems, 33: 0 13139--13150, 2020

work page 2020
[43]

Open-world object manipulation using pre-trained vision-language models

Stone, A., Xiao, T., Lu, Y., Gopalakrishnan, K., Lee, K.-H., Vuong, Q., Wohlhart, P., Kirmani, S., Zitkovich, B., Xia, F., et al. Open-world object manipulation using pre-trained vision-language models. arXiv preprint arXiv:2303.00905, 2023

work page arXiv 2023
[44]

A computational model for the alignment of hierarchical scene representations in human-robot interaction

Swadzba, A., Vorwerg, C., Wachsmuth, S., and Rickheit, G. A computational model for the alignment of hierarchical scene representations in human-robot interaction. In Twenty-First International Joint Conference on Artificial Intelligence. Citeseer, 2009

work page 2009
[45]

N., Zhu, S.-C., and Liu, H

Wang, S., Han, M., Jiao, Z., Zhang, Z., Wu, Y. N., Zhu, S.-C., and Liu, H. Llm\^ 3: Large language model-based task and motion planning with motion failure reasoning. arXiv preprint arXiv:2403.11552, 2024

work page arXiv 2024
[46]

Tinyvla: Towards fast, data-efficient vision- language-action models for robotic manipulation

Wen, J., Zhu, Y., Li, J., Zhu, M., Wu, K., Xu, Z., Liu, N., Cheng, R., Shen, C., Peng, Y., et al. Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation. arXiv preprint arXiv:2409.12514, 2024

work page arXiv 2024
[47]

Robi butler: Remote multimodal interactions with household robot assistant

Xiao, A., Janaka, N., Hu, T., Gupta, A., Li, K., Yu, C., and Hsu, D. Robi butler: Remote multimodal interactions with household robot assistant. arXiv preprint arXiv:2409.20548, 2024

work page arXiv 2024
[48]

Robotic Control via Embodied Chain-of-Thought Reasoning

Zawalski, M., Chen, W., Pertsch, K., Mees, O., Finn, C., and Levine, S. Robotic control via embodied chain-of-thought reasoning. arXiv preprint arXiv:2407.08693, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Zhao, T. Z., Kumar, V., Levine, S., and Finn, C. Learning fine-grained bimanual manipulation with low-cost hardware. arXiv preprint arXiv:2304.13705, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

Universal actions for enhanced embodied foundation models

Zheng, J., Li, J., Liu, D., Zheng, Y., Wang, Z., Ou, Z., Liu, Y., Liu, J., Zhang, Y.-Q., and Zhan, X. Universal actions for enhanced embodied foundation models. arXiv preprint arXiv:2501.10105, 2025

work page arXiv 2025
[51]

Closed-loop open-vocabulary mobile manipulation with gpt-4v

Zhi, P., Zhang, Z., Han, M., Zhang, Z., Li, Z., Jiao, Z., Jia, B., and Huang, S. Closed-loop open-vocabulary mobile manipulation with gpt-4v. arXiv preprint arXiv:2404.10220, 2024

work page arXiv 2024