pith. machine review for the scientific record. sign in

arxiv: 2602.13193 · v3 · submitted 2026-02-13 · 💻 cs.RO

Recognition: no theorem link

Steerable Vision-Language-Action Policies for Embodied Reasoning and Hierarchical Control

Authors on Pith no claims yet

Pith reviewed 2026-05-15 22:02 UTC · model grok-4.3

classification 💻 cs.RO
keywords steerable policiesvision-language-action modelsembodied reasoninghierarchical controlrobot manipulationVLM groundingsynthetic commandstask generalization
0
0 comments X

The pith

Steerable Policies trained on multi-level synthetic commands let VLMs steer low-level robot actions more precisely and improve generalization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Steerable Policies as vision-language-action models trained on synthetic commands spanning subtasks, motions, and grounded pixel coordinates. This richer interface gives high-level vision-language models finer control over low-level behavior than natural language instructions alone can provide. Experiments pair these policies with either a learned embodied reasoner or an off-the-shelf VLM using in-context learning, showing gains over prior VLAs and hierarchical baselines on real-world manipulation tasks. The central idea is that better low-level controllability unlocks the common-sense knowledge already present in pretrained VLMs. Gains appear especially on generalization and long-horizon sequences.

Core claim

Steerable Policies are VLAs trained on rich synthetic commands at multiple abstraction levels, including subtasks, motions, and grounded pixel coordinates. This training produces a low-level policy that high-level VLMs or learned reasoners can steer through these explicit command abstractions. When the resulting system is tested on real-world manipulation, both the learned-reasoner variant and the prompted-VLM variant outperform prior embodied-reasoning VLAs and VLM-based hierarchical baselines, with the largest gains on tasks that require generalization or long horizons.

What carries the argument

Steerable Policies: VLAs trained on synthetic multi-level commands (subtasks, motions, grounded pixel coordinates) that serve as a controllable interface for high-level VLMs to steer low-level robot behavior.

If this is right

  • Steerable Policies controlled by a learned high-level embodied reasoner outperform prior methods on manipulation tasks.
  • Off-the-shelf VLMs prompted to reason over command abstractions via in-context learning can also steer Steerable Policies effectively.
  • The approach yields larger gains on challenging generalization and long-horizon tasks than on simple ones.
  • Improved low-level controllability allows pretrained VLM knowledge to transfer more successfully into robot behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same multi-level command training could be applied to other hierarchical robot systems that currently rely on natural language interfaces.
  • If synthetic command training scales, it may reduce dependence on large amounts of real-world robot data for low-level policy learning.
  • The method invites testing whether different VLM sizes or architectures benefit unequally from the added controllability.

Load-bearing premise

Training on synthetic multi-level commands transfers to real robot execution without a large domain gap, and the richer command set actually lets VLMs steer behavior in ways that improve generalization beyond natural language alone.

What would settle it

Real-world trials in which the Steerable Policy controlled by a VLM shows no improvement or worse performance than a standard VLA using only natural language commands on held-out generalization tasks would falsify the central claim.

Figures

Figures reproduced from arXiv: 2602.13193 by Andy Tang, Catherine Glossop, Danny Driess, Jagdeep Singh Bhatia, Karl Pertsch, Nikhil Mathihalli, Ria Doshi, Sergey Levine, William Chen.

Figure 1
Figure 1. Figure 1: We propose Steerable Policies: vision-language-action models that can robustly follow a wide range of detailed commands (green boxes on right), going beyond usual task language to include instructions such as motions or pixel coordinates of the gripper and objects. The flexibility afforded by Steerable Policies enables substantially improved transfer of VLMs’ pretrained reasoning, semantic knowledge, and i… view at source ↗
Figure 2
Figure 2. Figure 2: The hierarchical policy inference loop, where a high [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Our automated pipeline for annotating robot data with [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Interactive interface for querying humans for oracle [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Allowing the oracle human user to issue any command style to our Steerable Policy nearly saturates performance on Bridge. By restricting the user to each style alone, we find each one is suited to different task types. All individual styles are better than directly providing the task-level label that is used by regular VLAs. Error bars denote ±1StdErr. mands beyond standard task-level language. Our first e… view at source ↗
Figure 7
Figure 7. Figure 7: Our approach of controlling Steerable Policies with learned high-level embodied reasoning VLMs outperforms five [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: In-context learning VLMs can effectively select abstractions for instructing our Steerable Policy. Error bars denote [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Example reasonings when using an in-context learning [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 11
Figure 11. Figure 11: Examples of what we deem the manifold of “rea [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 10
Figure 10. Figure 10: Example steering commands. All labels and points in the image are purely for visualization, and do not appear on the actual robot training data images. The list on the bottom is exactly extracted from our training dataset, with bold representing the subtasks and the dashed list representing the corresponding diverse steering commands. Each subtask typically has more than 5 corresponding steering commands,… view at source ↗
Figure 12
Figure 12. Figure 12: Hyperparameters for training both the Steerable Policy and high￾level embodied reasoner, taken from the OpenVLA parameter logging files (as both are trained by adapting the OpenVLA codebase). A. Steerable Policy Training 1) OpenVLA-based Steerable Policies: We train our first Steerable Policy by adapting the OpenVLA codebase [10]. We use all provided default hyperparameters used for training the model on … view at source ↗
Figure 13
Figure 13. Figure 13: Example starting states for the tasks for the didactic experiment wherein a human operator acts as the high-level policy [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Example starting states for the tasks for evaluating embodied reasoning VLAs, reproduced with permission from Chen et al. [9] (as we reuse their task suite). Make the mushroom the only object on the plate Put all the food in the blue pot and stuffed toys in the tan pot Stack the pots Put the banana on the [left / right] on the plate Put the hammer on the towel Make the blue block the only object on the pl… view at source ↗
Figure 15
Figure 15. Figure 15: Example starting states for the multi-step tasks for evaluating in-context learning high-level VLMs [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Examples of embodied reasonings produced by our fine-tuned VLM, taken from rollouts for the evaluations in Sec. VI-B [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Examples of in-context reasonings produced by our off-the-shelf VLM for issuing steering commands, taken from rollouts for the evaluations in Sec. VI-C. These are the unparaphrased versions of the examples in [PITH_FULL_IMAGE:figures/full_fig_p024_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Illustrative example of why VLAs cannot leverage VLMs’ in-context learning, visual understanding, and reasoning well when issued only subtask commands (from the SayCan-like baseline in Sec. VI-C). The high-level VLM reliably detects incorrect behaviors and what the robot should do to progress the task. However, the low-level VLA fails to make use of this when only commanded with subtasks. Instead, when th… view at source ↗
Figure 19
Figure 19. Figure 19: The prompt for dividing Bridge tasks into subtasks. I am trying to label frames of a robot demonstration with various possible instructions. I will give the overall task and a list of all subtasks. Then, for each subtask, I will give a description of each frame, consisting of the timestep, gripper movement, gripper position, and a (possibly incomplete) list of object positions. All positions are pixel coo… view at source ↗
Figure 20
Figure 20. Figure 20: The prompt for generating steering commands for Bridge tasks. I want my robot to reason about its observation before choosing its behaviors. I have a dataset of demonstrations where the robot arm is solving tasks. I will provide a description of the task, the robot’s observation, text descriptions of the robot’s actions, a plan for what the robot will do, and what the robot’s current subtask likely is. I … view at source ↗
Figure 21
Figure 21. Figure 21: The prompt for rationalizing steering commands for Bridge tasks [PITH_FULL_IMAGE:figures/full_fig_p026_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Our approach’s Gemini prompt for in-context learning VLM experiments. Note that any text in square brackets (eg. [task description]) in the prompt above is replaced with the corresponding object before the prompt is passed to the VLM [PITH_FULL_IMAGE:figures/full_fig_p027_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Non-reasoning ablation’s Gemini prompt for in-context learning VLM experiments. The only change from the full prompt ( [PITH_FULL_IMAGE:figures/full_fig_p028_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: SayCan-like baseline’s Gemini prompt for in-context learning VLM experiments. The only change from the full prompt ( [PITH_FULL_IMAGE:figures/full_fig_p028_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Prompt for extracting grounded keypoints from image observations based on a description of a target object. If a VLM instruction contains multiple keypoint descriptions, this prompt is called multiple times. Note that any text in square brackets (eg. [image observation]) in the prompt above is replaced with the corresponding object before the prompt is passed to the VLM [PITH_FULL_IMAGE:figures/full_fig_… view at source ↗
read the original abstract

Pretrained vision-language models (VLMs) can make semantic and visual inferences across diverse settings, providing valuable common-sense priors for robotic control. However, effectively grounding this knowledge in robot behaviors remains an open challenge. Prior methods often employ a hierarchical approach where VLMs reason over high-level commands to be executed by separate low-level policies, e.g., vision-language-action models (VLAs). The interface between VLMs and VLAs is usually natural language task instructions, which fundamentally limits how much VLM reasoning can steer low-level behavior. We thus introduce Steerable Policies: VLAs trained on rich synthetic commands at various levels of abstraction, like subtasks, motions, and grounded pixel coordinates. By improving low-level controllability, Steerable Policies can unlock pretrained knowledge in VLMs, enabling improved task generalization. We demonstrate this benefit by controlling our Steerable Policies with both a learned high-level embodied reasoner and an off-the-shelf VLM prompted to reason over command abstractions via in-context learning. Across extensive real-world manipulation experiments, these two novel methods outperform prior embodied reasoning VLAs and VLM-based hierarchical baselines, including on challenging generalization and long-horizon tasks. Website: steerable-policies.github.io

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Steerable Policies as VLAs trained on rich synthetic commands at multiple abstraction levels (subtasks, motions, grounded pixel coordinates) to enhance low-level controllability. This richer interface is claimed to unlock pretrained VLM knowledge for better embodied reasoning and hierarchical control. The work evaluates two control methods—a learned high-level embodied reasoner and an off-the-shelf VLM using in-context learning over command abstractions—and reports that both outperform prior VLAs and VLM-based hierarchical baselines on real-world manipulation tasks, including generalization and long-horizon scenarios.

Significance. If the empirical claims hold with proper validation, the result would be significant for embodied AI by demonstrating a practical way to bridge high-level VLM reasoning with low-level robot control via synthetic multi-level commands, potentially improving generalization without heavy real-world fine-tuning. The use of both learned and prompted VLM controllers is a notable strength, as is the focus on real hardware evaluation.

major comments (2)
  1. [Abstract] Abstract: The central claim that Steerable Policies unlock VLM knowledge and yield outperformance on real-world experiments is stated without any quantitative metrics, success rates, error bars, data splits, baseline details, or statistical significance. This is load-bearing because the reported gains could arise solely from low-level policy improvements rather than the richer command interface enabling better VLM steering.
  2. [Experiments] Experiments section: No evidence is provided for command-following success rates on held-out real trajectories or ablations isolating the effect of command richness (e.g., multi-level vs. natural language only). This directly undermines validation of the weakest assumption that synthetic training transfers without domain gap and that the additional channels transmit useful VLM reasoning.
minor comments (1)
  1. [Abstract] Abstract: The provided website link is useful but the summary text does not reference specific figures, tables, or sections containing the quantitative results needed to support the claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important opportunities to strengthen the presentation of quantitative evidence and experimental validation. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that Steerable Policies unlock VLM knowledge and yield outperformance on real-world experiments is stated without any quantitative metrics, success rates, error bars, data splits, baseline details, or statistical significance. This is load-bearing because the reported gains could arise solely from low-level policy improvements rather than the richer command interface enabling better VLM steering.

    Authors: We agree that the abstract would benefit from explicit quantitative support. In the revised version we will incorporate key success rates from the real-world experiments (with error bars), baseline comparisons, and a brief note on data splits and statistical testing. We will also explicitly reference the ablations in the experiments section that isolate the contribution of the multi-level command interface, thereby clarifying that gains are not attributable solely to low-level policy improvements. revision: yes

  2. Referee: [Experiments] Experiments section: No evidence is provided for command-following success rates on held-out real trajectories or ablations isolating the effect of command richness (e.g., multi-level vs. natural language only). This directly undermines validation of the weakest assumption that synthetic training transfers without domain gap and that the additional channels transmit useful VLM reasoning.

    Authors: We will add a new subsection in the experiments that reports command-following success rates on held-out real-world trajectories. We will also include explicit ablations that compare multi-level synthetic commands against natural-language-only interfaces, directly measuring the incremental benefit of command richness. These additions will provide the requested evidence for synthetic-to-real transfer and the utility of the additional steering channels. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with external real-world validation

full rationale

The paper proposes training VLAs on synthetic multi-level commands (subtasks, motions, pixel coordinates) and evaluates the resulting steerable policies via real-robot experiments with both learned reasoners and prompted VLMs. No mathematical derivation chain exists; claims rest on empirical outperformance rather than any equation or parameter that reduces to its own inputs by construction. No self-citation is invoked as a uniqueness theorem or load-bearing premise. The central benefit (improved VLM steering via richer interfaces) is tested against baselines on held-out real tasks, satisfying the requirement for independent external evidence.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical robotics paper; no mathematical derivations, fitted constants, or new theoretical entities are introduced in the abstract. The central claim rests on the assumption that richer command interfaces improve transfer and generalization, which is tested experimentally rather than derived.

pith-pipeline@v0.9.0 · 5545 in / 1212 out tokens · 22137 ms · 2026-05-15T22:02:59.533818+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Breaking Lock-In: Preserving Steerability under Low-Data VLA Post-Training

    cs.RO 2026-04 unverdicted novelty 6.0

    DeLock mitigates lock-in in low-data VLA post-training via visual grounding preservation and test-time contrastive prompt guidance, outperforming baselines across eight evaluations while matching data-heavy generalist...

Reference graph

Works this paper leans on

100 extracted references · 100 canonical work pages · cited by 1 Pith paper · 13 internal anchors

  1. [1]

    Do as i can, not as i say: Grounding language in robotic affordances, 2022

    Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Haus- man, Alex Herzog, Daniel Ho, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Eric Jang, Rosario Jauregui Ruano, Kyle Jeffrey, Sally Jesmonth, Nikhil J Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang,...

  2. [2]

    Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models

    Lucy Xiaoyang Shi, Brian Ichter, Michael Equi, Liyiming Ke, Karl Pertsch, Quan Vuong, James Tanner, Anna Walling, Haohuan Wang, Niccolo Fusai, Adrian Li-Bell, Danny Driess, Lachy Groom, Sergey Levine, and Chelsea Finn. Hi robot: Open-ended instruction following with hierarchical vision-language-action models, 2025. URL https://arxiv.org/abs/2502.19417

  3. [3]

    Limited Linguistic Diversity in Embodied AI Datasets

    Selma Wanna, Agnes Luhtaru, Jonathan Salfity, Ryan Barron, Juston Moore, Cynthia Matuszek, and Mitch Pryor. Limited linguistic diversity in embodied ai datasets, 2026. URL https://arxiv.org/abs/2601.03136

  4. [4]

    Bridge data: Boosting generalization of robotic skills with cross- domain datasets, 2021

    Frederik Ebert, Yanlai Yang, Karl Schmeckpeper, Bernadette Bucher, Georgios Georgakis, Kostas Dani- ilidis, Chelsea Finn, and Sergey Levine. Bridge data: Boosting generalization of robotic skills with cross- domain datasets, 2021

  5. [5]

    Bridgedata v2: A dataset for robot learning at scale

    Homer Walke, Kevin Black, Abraham Lee, Moo Jin Kim, Max Du, Chongyi Zheng, Tony Zhao, Philippe Hansen- Estruch, Quan Vuong, Andre He, Vivek Myers, Kuan Fang, Chelsea Finn, and Sergey Levine. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning (CoRL), 2023

  6. [6]

    Chain-of-thought prompting elicits reason- ing in large language models, 2023

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reason- ing in large language models, 2023

  7. [7]

    Large language models are zero-shot reasoners, 2023

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yu- taka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners, 2023

  8. [8]

    Robotic control via embodied chain-of-thought reasoning

    Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning. InConference on Robot Learning, 2024

  9. [9]

    Training strategies for efficient embodied reasoning, 2025

    William Chen, Suneel Belkhale, Suvir Mirchandani, Oier Mees, Danny Driess, Karl Pertsch, and Sergey Levine. Training strategies for efficient embodied reasoning, 2025

  10. [10]

    Openvla: An open-source vision-language-action model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model. 2024

  11. [11]

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pert...

  12. [12]

    Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S

    Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, Shyamal Buch, Dallas Card, Rodrigo Castellon, Niladri Chatterji, Annie Chen, Kathleen Creel, Jared Quincy Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stef...

  13. [13]

    Rt-2: Vision-language-action models transfer web knowl- edge to robotic control, 2023

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Flo- rence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexan- der Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalashn...

  14. [14]

    FAST: Efficient Action Tokenization for Vision-Language-Action Models

    Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tok- enization for vision-language-action models, 2025. URL https://arxiv.org/abs/2501.09747

  15. [15]

    Minivla: A better vla with a smaller footprint, 2024

    Suneel Belkhale and Dorsa Sadigh. Minivla: A better vla with a smaller footprint, 2024. URL https://ai.stanford. edu/blog/minivla/

  16. [16]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky.π 0: A vi...

  17. [17]

    Fine- tuning vision-language-action models: Optimizing speed and success, 2025

    Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine- tuning vision-language-action models: Optimizing speed and success, 2025. URL https://arxiv.org/abs/2502. 19645

  18. [18]

    Gemini Robotics: Bringing AI into the Physical World

    Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gonzalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, Steven Bo- hez, Konstantinos Bousmalis, Anthony Brohan, Thomas Buschmann, Arunkumar Byravan, Serkan Cabi, Ken Caluwaerts, Federico Casarini, Oscar Chang, Jose En- ri...

  19. [19]

    Cast: Counterfactual labels improve instruction following in vision-language-action models, 2025

    Catherine Glossop, William Chen, Arjun Bhorkar, Dhruv Shah, and Sergey Levine. Cast: Counterfactual labels improve instruction following in vision-language-action models, 2025. URL https://arxiv.org/abs/2508.13446

  20. [20]

    Robotic skill acquisition via instruc- tion augmentation with vision-language models, 2022

    Ted Xiao, Harris Chan, Pierre Sermanet, Ayzaan Wahid, Anthony Brohan, Karol Hausman, Sergey Levine, and Jonathan Tompson. Robotic skill acquisition via instruc- tion augmentation with vision-language models, 2022

  21. [21]

    Steer: Flexible robotic manipulation via dense language grounding, 2024

    Laura Smith, Alex Irpan, Montserrat Gonzalez Arenas, Sean Kirmani, Dmitry Kalashnikov, Dhruv Shah, and Ted Xiao. Steer: Flexible robotic manipulation via dense language grounding, 2024. URL https://arxiv.org/abs/ 2411.03409

  22. [22]

    Jesse Zhang, Karl Pertsch, Jiahui Zhang, and Joseph J. Lim. Sprint: Scalable policy pre-training via language instruction relabeling, 2024. URL https://arxiv.org/abs/ 2306.11886

  23. [23]

    Interactive language: Talking to robots in real time, 2022

    Corey Lynch, Ayzaan Wahid, Jonathan Tompson, Tianli Ding, James Betker, Robert Baruch, Travis Armstrong, and Pete Florence. Interactive language: Talking to robots in real time, 2022

  24. [24]

    R., Ramos, F., Fox, D., Li, A., Gupta, A., and Goyal, A

    Yi Li, Yuquan Deng, Jesse Zhang, Joel Jang, Marius Memmel, Raymond Yu, Caelan Reed Garrett, Fabio Ramos, Dieter Fox, Anqi Li, Abhishek Gupta, and Ankit Goyal. Hamster: Hierarchical action models for open- world robot manipulation, 2025. URL https://arxiv.org/ abs/2502.05485

  25. [25]

    Rt-trajectory: Robotic task generalization via hindsight trajectory sketches,

    Jiayuan Gu, Sean Kirmani, Paul Wohlhart, Yao Lu, Montserrat Gonzalez Arenas, Kanishka Rao, Wenhao Yu, Chuyuan Fu, Keerthana Gopalakrishnan, Zhuo Xu, Priya Sundaresan, Peng Xu, Hao Su, Karol Hausman, Chelsea Finn, Quan Vuong, and Ted Xiao. Rt-trajectory: Robotic task generalization via hindsight trajectory sketches,

  26. [26]

    URL https://arxiv.org/abs/2311.01977

  27. [27]

    TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

    Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daum ´e III, Andrey Kolobov, Furong Huang, and Jianwei Yang. Tracevla: Visual trace prompting en- hances spatial-temporal awareness for generalist robotic policies, 2025. URL https://arxiv.org/abs/2412.10345

  28. [28]

    Omnivla: An omni-modal vision- language-action model for robot navigation, 2025

    Noriaki Hirose, Catherine Glossop, Dhruv Shah, and Sergey Levine. Omnivla: An omni-modal vision- language-action model for robot navigation, 2025. URL https://arxiv.org/abs/2509.19480

  29. [29]

    MolmoAct: Action Reasoning Models that can Reason in Space

    Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, Winson Han, Wilbert Pumacay, An- gelica Wu, Rose Hendrix, Karen Farley, Eli VanderBilt, Ali Farhadi, Dieter Fox, and Ranjay Krishna. Molmoact: Action reasoning models that can reason in space, 2025. URL https://arxiv.org/abs/2508.07917

  30. [30]

    Im- proving image generation with better captions

    James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jian- feng Wang, Linjie Li, LongOuyang, Juntang Zhuang, Joyce Lee, Yufei Guo, Wesam Manassra, Prafulla Dhari- wal, Casey Chu, Yunxin Jiao, and Aditya Ramesh. Im- proving image generation with better captions. 2023

  31. [31]

    Integrated task and motion plan- ning, 2020

    Caelan Reed Garrett, Rohan Chitnis, Rachel Holladay, Beomjoon Kim, Tom Silver, Leslie Pack Kaelbling, and Tom´as Lozano-P ´erez. Integrated task and motion plan- ning, 2020

  32. [32]

    Farrar, Straus and Giroux, 2011

    Daniel Kahneman.Thinking, fast and slow. Farrar, Straus and Giroux, 2011

  33. [33]

    EMMA: End-to-End Multimodal Model for Autonomous Driving

    Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, Yin Zhou, James Guo, Dragomir Anguelov, and Mingxing Tan. Emma: End- to-end multimodal model for autonomous driving, 2024. URL https://arxiv.org/abs/2410.23262

  34. [34]

    Zhao, Archit Sharma, Karl Pertsch, Jianlan Luo, Sergey Levine, and Chelsea Finn

    Lucy Xiaoyang Shi, Zheyuan Hu, Tony Z. Zhao, Archit Sharma, Karl Pertsch, Jianlan Luo, Sergey Levine, and Chelsea Finn. Yell at your robot: Improving on-the-fly from language corrections, 2024

  35. [35]

    Interactive task planning with language models, 2025 a

    Boyi Li, Philipp Wu, Pieter Abbeel, and Jitendra Malik. Interactive task planning with language models, 2025. URL https://arxiv.org/abs/2310.10645

  36. [36]

    Lm-nav: Robotic navigation with large pre- trained models of language, vision, and action, 2022

    Dhruv Shah, Blazej Osinski, Brian Ichter, and Sergey Levine. Lm-nav: Robotic navigation with large pre- trained models of language, vision, and action, 2022. URL https://arxiv.org/abs/2207.04429

  37. [37]

    Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,

    Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,

  38. [38]

    URL https://arxiv.org/abs/2201.07207

  39. [39]

    Inner monologue: Embodied reasoning through planning with language models, 2022

    Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Noah Brown, Tomas Jackson, Linda Luu, Sergey Levine, Karol Hausman, and Brian Ichter. Inner monologue: Embodied reasoning through planning with language models, 2022

  40. [40]

    Socratic models: Composing zero-shot multimodal reasoning with language, 2022

    Andy Zeng, Maria Attarian, Brian Ichter, Krzysztof Choromanski, Adrian Wong, Stefan Welker, Federico Tombari, Aveek Purohit, Michael Ryoo, Vikas Sind- hwani, Johnny Lee, Vincent Vanhoucke, and Pete Flo- rence. Socratic models: Composing zero-shot multimodal reasoning with language, 2022

  41. [41]

    Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. Palm-e: An embodied ...

  42. [42]

    Jesse Zhang, Jiahui Zhang, Karl Pertsch, Ziyi Liu, Xiang Ren, Minsuk Chang, Shao-Hua Sun, and Joseph J. Lim. Bootstrap your own skills: Learning to solve new tasks with large language model guidance, 2023. URL https: //arxiv.org/abs/2310.10021

  43. [43]

    Scaling up and distilling down: Language-guided robot skill acquisition, 2023

    Huy Ha, Pete Florence, and Shuran Song. Scaling up and distilling down: Language-guided robot skill acquisition, 2023

  44. [44]

    Rt-h: Action hierar- chies using language, 2024

    Suneel Belkhale, Tianli Ding, Ted Xiao, Pierre Sermanet, Quon Vuong, Jonathan Tompson, Yevgen Chebotar, De- bidatta Dwibedi, and Dorsa Sadigh. Rt-h: Action hierar- chies using language, 2024

  45. [45]

    Lohovla: A unified vision-language-action model for long-horizon embodied tasks, 2025

    Yi Yang, Jiaxuan Sun, Siqi Kou, Yihan Wang, and Zhijie Deng. Lohovla: A unified vision-language-action model for long-horizon embodied tasks, 2025. URL https://arxiv.org/abs/2506.00411

  46. [46]

    From code to action: Hierarchical learning of diffusion- vlm policies, 2025

    Markus Peschl, Pietro Mazzaglia, and Daniel Dijkman. From code to action: Hierarchical learning of diffusion- vlm policies, 2025. URL https://arxiv.org/abs/2509. 24917

  47. [47]

    Robix: A unified model for robot interaction, reasoning and planning, 2025

    Huang Fang, Mengxi Zhang, Heng Dong, Wei Li, Zixuan Wang, Qifeng Zhang, Xueyun Tian, Yucheng Hu, and Hang Li. Robix: A unified model for robot interaction, reasoning and planning, 2025. URL https://arxiv.org/abs/ 2509.01106

  48. [48]

    Galaxea open-world dataset and g0 dual-system vla model.arXiv preprint arXiv:2509.00576, 2025

    Tao Jiang, Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Jianning Cui, Xiao Liu, Shuiqi Cheng, Jiyang Gao, Huazhe Xu, and Hang Zhao. Galaxea open-world dataset and g0 dual-system vla model, 2025. URL https://arxiv.org/abs/2509.00576

  49. [49]

    Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

    Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tri- pathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert, Yv...

  50. [50]

    Sam 2: Segment anything in images and videos, 2024

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Rong- hang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R ¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Doll ´ar, and Christoph Feichtenhofer. Sam 2: Segment anything in images and videos, 2024. URL https://arxi...

  51. [51]

    End-to-end object detection with transform- ers, 2020

    Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transform- ers, 2020. URL https://arxiv.org/abs/2005.12872

  52. [52]

    Gemini: A family of highly capable multimodal models, 2024

    Gemini Team. Gemini: A family of highly capable multimodal models, 2024

  53. [53]

    Tensorrt- openvla, 2025

    William Chen, Michał Zawalski, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Tensorrt- openvla, 2025. URL https://github.com/rail-berkeley/ tensorrt-openvla

  54. [54]

    Tensorrt-llm

    NVIDIA. Tensorrt-llm. https://github.com/NVIDIA/ TensorRT-LLM?tab=readme-ov-file, 2024

  55. [55]

    In-context imitation learning via next-token prediction.arXiv preprint arXiv:2408.15980, 2024

    Letian Fu, Huang Huang, Gaurav Datta, Lawrence Yun- liang Chen, William Chung-Ho Panitch, Fangchen Liu, Hui Li, and Ken Goldberg. In-context imitation learning via next-token prediction.arXiv preprint arXiv:2408.15980, 2024

  56. [56]

    In-context learning enables robot action prediction in llms, 2025

    Yida Yin, Zekai Wang, Yuvan Sharma, Dantong Niu, Trevor Darrell, and Roei Herzig. In-context learning enables robot action prediction in llms, 2025. URL https://arxiv.org/abs/2410.12782

  57. [57]

    Embodiment Collaboration, Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, Albert Tung, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anchit Gupta, Andrew Wang, Andrey Kolobov, Anikait Singh, Animesh Garg, Aniruddha Kembhavi, Anni...

  58. [58]

    Octo: An open-source generalist robot policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Charles Xu, Jianlan Luo, Tobias Kreiman, You Liang Tan, Lawrence Yunliang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy. InProceedings of Robotics: Science and...

  59. [59]

    Pris- matic vlms: Investigating the design space of visually- conditioned language models, 2024

    Siddharth Karamcheti, Suraj Nair, Ashwin Balakrishna, Percy Liang, Thomas Kollar, and Dorsa Sadigh. Pris- matic vlms: Investigating the design space of visually- conditioned language models, 2024

  60. [60]

    PaliGemma: A versatile 3B VLM for transfer

    Lucas Beyer, Andreas Steiner, Andr ´e Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschan- nen, Emanuele Bugliarello, Thomas Unterthiner, Daniel Keysers, Skanda Koppula, Fangyu Liu, Adam Gryc- ner, Alexey Gritsenko, Neil Houlsby, Manoj Ku- mar, Keran Rong, Julian Eisenschlos, Rishabh Kabra, Matt...

  61. [61]

    Diffusion Models Beat GANs on Image Synthesis

    Prafulla Dhariwal and Alex Nichol. Diffusion models beat gans on image synthesis, 2021. URL https://arxiv. org/abs/2105.05233

  62. [62]

    Classifier-free diffusion guidance, 2022

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance, 2022

  63. [63]

    Inference-time policy steering through human interactions, 2024

    Yanwei Wang, Lirui Wang, Yilun Du, Balakumar Sun- daralingam, Xuning Yang, Yu-Wei Chao, Claudia Perez- D’Arpino, Dieter Fox, and Julie Shah. Inference-time policy steering through human interactions, 2024

  64. [64]

    Steering your generalists: Improving robotic foundation models via value guidance.Confer- ence on Robot Learning (CoRL), 2024

    Mitsuhiko Nakamoto, Oier Mees, Aviral Kumar, and Sergey Levine. Steering your generalists: Improving robotic foundation models via value guidance.Confer- ence on Robot Learning (CoRL), 2024

  65. [65]

    Steering your diffusion policy with latent space reinforcement learning,

    Andrew Wagenmaker, Mitsuhiko Nakamoto, Yunchu Zhang, Seohong Park, Waleed Yagoub, Anusha Naga- bandi, Abhishek Gupta, and Sergey Levine. Steering your diffusion policy with latent space reinforcement learning,

  66. [66]

    URL https://arxiv.org/abs/2506.15799

  67. [67]

    Diffusion guidance is a controllable policy im- provement operator.arXiv preprint arXiv:2505.23458,

    Kevin Frans, Seohong Park, Pieter Abbeel, and Sergey Levine. Diffusion guidance is a controllable policy improvement operator, 2025. URL https://arxiv.org/abs/ 2505.23458

  68. [68]

    Prefix-Tuning: Optimizing Continuous Prompts for Generation

    Xiang Lisa Li and Percy Liang. Prefix-tuning: Opti- mizing continuous prompts for generation, 2021. URL https://arxiv.org/abs/2101.00190

  69. [69]

    Robomonkey: Scaling test-time sampling and verification for vision-language-action models, 2025

    Jacky Kwok, Christopher Agia, Rohan Sinha, Matt Foutter, Shulu Li, Ion Stoica, Azalia Mirhoseini, and Marco Pavone. Robomonkey: Scaling test-time sampling and verification for vision-language-action models, 2025. URL https://arxiv.org/abs/2506.17811

  70. [70]

    Dynaguide: Steering diffusion policies with active dynamic guidance

    Maximilian Du and Shuran Song. Dynaguide: Steering diffusion policies with active dynamic guidance. InPro- ceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS), 2025

  71. [71]

    From foresight to forethought: Vlm-in-the-loop policy steering via latent alignment, 2025

    Yilin Wu, Ran Tian, Gokul Swamy, and Andrea Bajcsy. From foresight to forethought: Vlm-in-the-loop policy steering via latent alignment, 2025. URL https://arxiv. org/abs/2502.01828

  72. [72]

    Goodman and Michael C

    Noah D. Goodman and Michael C. Frank. Pragmatic language interpretation as probabilistic inference.Trends in Cognitive Sciences, 20(11):818–829, 2016. ISSN 1364-6613. doi: https://doi.org/10.1016/j.tics.2016.08

  73. [73]

    URL https://www.sciencedirect.com/science/article/ pii/S136466131630122X

  74. [74]

    LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning, 2023. URL https://arxiv.org/abs/2306.03310

  75. [75]

    Libero-pro: Towards robust and fair eval- uation of vision-language-action models beyond memo- rization, 2025

    Xueyang Zhou, Yangming Xu, Guiyao Tie, Yongchao Chen, Guowen Zhang, Duanfeng Chu, Pan Zhou, and Lichao Sun. Libero-pro: Towards robust and fair eval- uation of vision-language-action models beyond memo- rization, 2025. URL https://arxiv.org/abs/2510.03827

  76. [76]

    Llama 2: Open foundation and fine-tuned chat models, 2023

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernan- des, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Har...

  77. [77]

    Di- nov2: Learning robust visual features without supervi- sion, 2024

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fer- nandez, Daniel Haziza, Francisco Massa, Alaaeldin El- Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv ´e Jegou, Julien Mairal, P...

  78. [78]

    Sigmoid loss for language image pre- training, 2023

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre- training, 2023

  79. [79]

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ash- win Balakrishna, Sudeep Dasari, Siddharth Karam- cheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abra- ham Le...

  80. [80]

    Ren, Homer Walke, Quan Vuong, Lucy Xiaoyang Shi, and Sergey Levine

    Danny Driess, Jost Tobias Springenberg, Brian Ichter, Lili Yu, Adrian Li-Bell, Karl Pertsch, Allen Z. Ren, Homer Walke, Quan Vuong, Lucy Xiaoyang Shi, and Sergey Levine. Knowledge insulating vision-language- action models: Train fast, run fast, generalize better,

Showing first 80 references.