arxiv: 2602.13193 · v3 · submitted 2026-02-13 · 💻 cs.RO

Recognition: no theorem link

Steerable Vision-Language-Action Policies for Embodied Reasoning and Hierarchical Control

William Chen , Jagdeep Singh Bhatia , Catherine Glossop , Nikhil Mathihalli , Ria Doshi , Andy Tang , Danny Driess , Karl Pertsch

show 1 more author

Sergey Levine

Authors on Pith no claims yet

Pith reviewed 2026-05-15 22:02 UTC · model grok-4.3

classification 💻 cs.RO

keywords steerable policiesvision-language-action modelsembodied reasoninghierarchical controlrobot manipulationVLM groundingsynthetic commandstask generalization

0 comments

The pith

Steerable Policies trained on multi-level synthetic commands let VLMs steer low-level robot actions more precisely and improve generalization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Steerable Policies as vision-language-action models trained on synthetic commands spanning subtasks, motions, and grounded pixel coordinates. This richer interface gives high-level vision-language models finer control over low-level behavior than natural language instructions alone can provide. Experiments pair these policies with either a learned embodied reasoner or an off-the-shelf VLM using in-context learning, showing gains over prior VLAs and hierarchical baselines on real-world manipulation tasks. The central idea is that better low-level controllability unlocks the common-sense knowledge already present in pretrained VLMs. Gains appear especially on generalization and long-horizon sequences.

Core claim

Steerable Policies are VLAs trained on rich synthetic commands at multiple abstraction levels, including subtasks, motions, and grounded pixel coordinates. This training produces a low-level policy that high-level VLMs or learned reasoners can steer through these explicit command abstractions. When the resulting system is tested on real-world manipulation, both the learned-reasoner variant and the prompted-VLM variant outperform prior embodied-reasoning VLAs and VLM-based hierarchical baselines, with the largest gains on tasks that require generalization or long horizons.

What carries the argument

Steerable Policies: VLAs trained on synthetic multi-level commands (subtasks, motions, grounded pixel coordinates) that serve as a controllable interface for high-level VLMs to steer low-level robot behavior.

If this is right

Steerable Policies controlled by a learned high-level embodied reasoner outperform prior methods on manipulation tasks.
Off-the-shelf VLMs prompted to reason over command abstractions via in-context learning can also steer Steerable Policies effectively.
The approach yields larger gains on challenging generalization and long-horizon tasks than on simple ones.
Improved low-level controllability allows pretrained VLM knowledge to transfer more successfully into robot behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same multi-level command training could be applied to other hierarchical robot systems that currently rely on natural language interfaces.
If synthetic command training scales, it may reduce dependence on large amounts of real-world robot data for low-level policy learning.
The method invites testing whether different VLM sizes or architectures benefit unequally from the added controllability.

Load-bearing premise

Training on synthetic multi-level commands transfers to real robot execution without a large domain gap, and the richer command set actually lets VLMs steer behavior in ways that improve generalization beyond natural language alone.

What would settle it

Real-world trials in which the Steerable Policy controlled by a VLM shows no improvement or worse performance than a standard VLA using only natural language commands on held-out generalization tasks would falsify the central claim.

Figures

Figures reproduced from arXiv: 2602.13193 by Andy Tang, Catherine Glossop, Danny Driess, Jagdeep Singh Bhatia, Karl Pertsch, Nikhil Mathihalli, Ria Doshi, Sergey Levine, William Chen.

**Figure 1.** Figure 1: We propose Steerable Policies: vision-language-action models that can robustly follow a wide range of detailed commands (green boxes on right), going beyond usual task language to include instructions such as motions or pixel coordinates of the gripper and objects. The flexibility afforded by Steerable Policies enables substantially improved transfer of VLMs’ pretrained reasoning, semantic knowledge, and i… view at source ↗

**Figure 2.** Figure 2: The hierarchical policy inference loop, where a high [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Our automated pipeline for annotating robot data with [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 5.** Figure 5: Interactive interface for querying humans for oracle [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Allowing the oracle human user to issue any command style to our Steerable Policy nearly saturates performance on Bridge. By restricting the user to each style alone, we find each one is suited to different task types. All individual styles are better than directly providing the task-level label that is used by regular VLAs. Error bars denote ±1StdErr. mands beyond standard task-level language. Our first e… view at source ↗

**Figure 7.** Figure 7: Our approach of controlling Steerable Policies with learned high-level embodied reasoning VLMs outperforms five [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: In-context learning VLMs can effectively select abstractions for instructing our Steerable Policy. Error bars denote [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 9.** Figure 9: Example reasonings when using an in-context learning [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 11.** Figure 11: Examples of what we deem the manifold of “rea [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 10.** Figure 10: Example steering commands. All labels and points in the image are purely for visualization, and do not appear on the actual robot training data images. The list on the bottom is exactly extracted from our training dataset, with bold representing the subtasks and the dashed list representing the corresponding diverse steering commands. Each subtask typically has more than 5 corresponding steering commands,… view at source ↗

**Figure 12.** Figure 12: Hyperparameters for training both the Steerable Policy and highlevel embodied reasoner, taken from the OpenVLA parameter logging files (as both are trained by adapting the OpenVLA codebase). A. Steerable Policy Training 1) OpenVLA-based Steerable Policies: We train our first Steerable Policy by adapting the OpenVLA codebase [10]. We use all provided default hyperparameters used for training the model on … view at source ↗

**Figure 13.** Figure 13: Example starting states for the tasks for the didactic experiment wherein a human operator acts as the high-level policy [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: Example starting states for the tasks for evaluating embodied reasoning VLAs, reproduced with permission from Chen et al. [9] (as we reuse their task suite). Make the mushroom the only object on the plate Put all the food in the blue pot and stuffed toys in the tan pot Stack the pots Put the banana on the [left / right] on the plate Put the hammer on the towel Make the blue block the only object on the pl… view at source ↗

**Figure 15.** Figure 15: Example starting states for the multi-step tasks for evaluating in-context learning high-level VLMs [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗

**Figure 16.** Figure 16: Examples of embodied reasonings produced by our fine-tuned VLM, taken from rollouts for the evaluations in Sec. VI-B [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗

**Figure 17.** Figure 17: Examples of in-context reasonings produced by our off-the-shelf VLM for issuing steering commands, taken from rollouts for the evaluations in Sec. VI-C. These are the unparaphrased versions of the examples in [PITH_FULL_IMAGE:figures/full_fig_p024_17.png] view at source ↗

**Figure 18.** Figure 18: Illustrative example of why VLAs cannot leverage VLMs’ in-context learning, visual understanding, and reasoning well when issued only subtask commands (from the SayCan-like baseline in Sec. VI-C). The high-level VLM reliably detects incorrect behaviors and what the robot should do to progress the task. However, the low-level VLA fails to make use of this when only commanded with subtasks. Instead, when th… view at source ↗

**Figure 19.** Figure 19: The prompt for dividing Bridge tasks into subtasks. I am trying to label frames of a robot demonstration with various possible instructions. I will give the overall task and a list of all subtasks. Then, for each subtask, I will give a description of each frame, consisting of the timestep, gripper movement, gripper position, and a (possibly incomplete) list of object positions. All positions are pixel coo… view at source ↗

**Figure 20.** Figure 20: The prompt for generating steering commands for Bridge tasks. I want my robot to reason about its observation before choosing its behaviors. I have a dataset of demonstrations where the robot arm is solving tasks. I will provide a description of the task, the robot’s observation, text descriptions of the robot’s actions, a plan for what the robot will do, and what the robot’s current subtask likely is. I … view at source ↗

**Figure 21.** Figure 21: The prompt for rationalizing steering commands for Bridge tasks [PITH_FULL_IMAGE:figures/full_fig_p026_21.png] view at source ↗

**Figure 22.** Figure 22: Our approach’s Gemini prompt for in-context learning VLM experiments. Note that any text in square brackets (eg. [task description]) in the prompt above is replaced with the corresponding object before the prompt is passed to the VLM [PITH_FULL_IMAGE:figures/full_fig_p027_22.png] view at source ↗

**Figure 23.** Figure 23: Non-reasoning ablation’s Gemini prompt for in-context learning VLM experiments. The only change from the full prompt ( [PITH_FULL_IMAGE:figures/full_fig_p028_23.png] view at source ↗

**Figure 24.** Figure 24: SayCan-like baseline’s Gemini prompt for in-context learning VLM experiments. The only change from the full prompt ( [PITH_FULL_IMAGE:figures/full_fig_p028_24.png] view at source ↗

**Figure 25.** Figure 25: Prompt for extracting grounded keypoints from image observations based on a description of a target object. If a VLM instruction contains multiple keypoint descriptions, this prompt is called multiple times. Note that any text in square brackets (eg. [image observation]) in the prompt above is replaced with the corresponding object before the prompt is passed to the VLM [PITH_FULL_IMAGE:figures/full_fig_… view at source ↗

read the original abstract

Pretrained vision-language models (VLMs) can make semantic and visual inferences across diverse settings, providing valuable common-sense priors for robotic control. However, effectively grounding this knowledge in robot behaviors remains an open challenge. Prior methods often employ a hierarchical approach where VLMs reason over high-level commands to be executed by separate low-level policies, e.g., vision-language-action models (VLAs). The interface between VLMs and VLAs is usually natural language task instructions, which fundamentally limits how much VLM reasoning can steer low-level behavior. We thus introduce Steerable Policies: VLAs trained on rich synthetic commands at various levels of abstraction, like subtasks, motions, and grounded pixel coordinates. By improving low-level controllability, Steerable Policies can unlock pretrained knowledge in VLMs, enabling improved task generalization. We demonstrate this benefit by controlling our Steerable Policies with both a learned high-level embodied reasoner and an off-the-shelf VLM prompted to reason over command abstractions via in-context learning. Across extensive real-world manipulation experiments, these two novel methods outperform prior embodied reasoning VLAs and VLM-based hierarchical baselines, including on challenging generalization and long-horizon tasks. Website: steerable-policies.github.io

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Training VLAs on synthetic multi-level commands (subtasks, motions, pixel coords) makes them more steerable by VLMs and yields real-robot gains on manipulation and long-horizon tasks.

read the letter

The main takeaway is that this work trains vision-language-action models on synthetic commands at several abstraction levels instead of plain natural language. The richer interface is meant to let VLMs steer low-level behavior more precisely, which they show helps generalization on real hardware. They test the idea with both a learned high-level reasoner and an off-the-shelf VLM prompted via in-context learning, and report that both beat prior embodied VLAs and hierarchical baselines on manipulation experiments, including generalization and long-horizon cases. The real-robot evaluation is the clearest strength; hardware results always carry more weight than simulation-only claims. The training regime itself is a straightforward but useful extension of existing VLA work. On the softer side, the abstract gives no numbers, no error bars, no ablation on which command levels drive the gains, and no details on data splits or statistical tests. That makes it hard to judge how large the improvement actually is or whether the extra command channels are doing the steering work or if the low-level policy is just better overall. The synthetic-to-real transfer concern is reasonable to flag until the full paper shows command-following success rates on held-out real trajectories. If those checks are there and hold, the central claim looks solid; if not, the outperformance could be explained by other factors. This is for people working on hierarchical robot control and VLM-VLA integration. A reader who cares about practical embodied systems would get concrete ideas from the command design and the hardware results. It deserves peer review because the idea is clear, the evaluation is on real robots, and the gaps are fixable with more quantitative presentation rather than fundamental problems.

Referee Report

2 major / 1 minor

Summary. The paper introduces Steerable Policies as VLAs trained on rich synthetic commands at multiple abstraction levels (subtasks, motions, grounded pixel coordinates) to enhance low-level controllability. This richer interface is claimed to unlock pretrained VLM knowledge for better embodied reasoning and hierarchical control. The work evaluates two control methods—a learned high-level embodied reasoner and an off-the-shelf VLM using in-context learning over command abstractions—and reports that both outperform prior VLAs and VLM-based hierarchical baselines on real-world manipulation tasks, including generalization and long-horizon scenarios.

Significance. If the empirical claims hold with proper validation, the result would be significant for embodied AI by demonstrating a practical way to bridge high-level VLM reasoning with low-level robot control via synthetic multi-level commands, potentially improving generalization without heavy real-world fine-tuning. The use of both learned and prompted VLM controllers is a notable strength, as is the focus on real hardware evaluation.

major comments (2)

[Abstract] Abstract: The central claim that Steerable Policies unlock VLM knowledge and yield outperformance on real-world experiments is stated without any quantitative metrics, success rates, error bars, data splits, baseline details, or statistical significance. This is load-bearing because the reported gains could arise solely from low-level policy improvements rather than the richer command interface enabling better VLM steering.
[Experiments] Experiments section: No evidence is provided for command-following success rates on held-out real trajectories or ablations isolating the effect of command richness (e.g., multi-level vs. natural language only). This directly undermines validation of the weakest assumption that synthetic training transfers without domain gap and that the additional channels transmit useful VLM reasoning.

minor comments (1)

[Abstract] Abstract: The provided website link is useful but the summary text does not reference specific figures, tables, or sections containing the quantitative results needed to support the claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important opportunities to strengthen the presentation of quantitative evidence and experimental validation. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that Steerable Policies unlock VLM knowledge and yield outperformance on real-world experiments is stated without any quantitative metrics, success rates, error bars, data splits, baseline details, or statistical significance. This is load-bearing because the reported gains could arise solely from low-level policy improvements rather than the richer command interface enabling better VLM steering.

Authors: We agree that the abstract would benefit from explicit quantitative support. In the revised version we will incorporate key success rates from the real-world experiments (with error bars), baseline comparisons, and a brief note on data splits and statistical testing. We will also explicitly reference the ablations in the experiments section that isolate the contribution of the multi-level command interface, thereby clarifying that gains are not attributable solely to low-level policy improvements. revision: yes
Referee: [Experiments] Experiments section: No evidence is provided for command-following success rates on held-out real trajectories or ablations isolating the effect of command richness (e.g., multi-level vs. natural language only). This directly undermines validation of the weakest assumption that synthetic training transfers without domain gap and that the additional channels transmit useful VLM reasoning.

Authors: We will add a new subsection in the experiments that reports command-following success rates on held-out real-world trajectories. We will also include explicit ablations that compare multi-level synthetic commands against natural-language-only interfaces, directly measuring the incremental benefit of command richness. These additions will provide the requested evidence for synthetic-to-real transfer and the utility of the additional steering channels. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with external real-world validation

full rationale

The paper proposes training VLAs on synthetic multi-level commands (subtasks, motions, pixel coordinates) and evaluates the resulting steerable policies via real-robot experiments with both learned reasoners and prompted VLMs. No mathematical derivation chain exists; claims rest on empirical outperformance rather than any equation or parameter that reduces to its own inputs by construction. No self-citation is invoked as a uniqueness theorem or load-bearing premise. The central benefit (improved VLM steering via richer interfaces) is tested against baselines on held-out real tasks, satisfying the requirement for independent external evidence.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical robotics paper; no mathematical derivations, fitted constants, or new theoretical entities are introduced in the abstract. The central claim rests on the assumption that richer command interfaces improve transfer and generalization, which is tested experimentally rather than derived.

pith-pipeline@v0.9.0 · 5545 in / 1212 out tokens · 22137 ms · 2026-05-15T22:02:59.533818+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Breaking Lock-In: Preserving Steerability under Low-Data VLA Post-Training
cs.RO 2026-04 unverdicted novelty 6.0

DeLock mitigates lock-in in low-data VLA post-training via visual grounding preservation and test-time contrastive prompt guidance, outperforming baselines across eight evaluations while matching data-heavy generalist...

Reference graph

Works this paper leans on

100 extracted references · 100 canonical work pages · cited by 1 Pith paper · 13 internal anchors

[1]

Do as i can, not as i say: Grounding language in robotic affordances, 2022

Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Haus- man, Alex Herzog, Daniel Ho, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Eric Jang, Rosario Jauregui Ruano, Kyle Jeffrey, Sally Jesmonth, Nikhil J Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang,...

work page 2022
[2]

Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models

Lucy Xiaoyang Shi, Brian Ichter, Michael Equi, Liyiming Ke, Karl Pertsch, Quan Vuong, James Tanner, Anna Walling, Haohuan Wang, Niccolo Fusai, Adrian Li-Bell, Danny Driess, Lachy Groom, Sergey Levine, and Chelsea Finn. Hi robot: Open-ended instruction following with hierarchical vision-language-action models, 2025. URL https://arxiv.org/abs/2502.19417

work page internal anchor Pith review arXiv 2025
[3]

Limited Linguistic Diversity in Embodied AI Datasets

Selma Wanna, Agnes Luhtaru, Jonathan Salfity, Ryan Barron, Juston Moore, Cynthia Matuszek, and Mitch Pryor. Limited linguistic diversity in embodied ai datasets, 2026. URL https://arxiv.org/abs/2601.03136

work page internal anchor Pith review Pith/arXiv arXiv 2026
[4]

Bridge data: Boosting generalization of robotic skills with cross- domain datasets, 2021

Frederik Ebert, Yanlai Yang, Karl Schmeckpeper, Bernadette Bucher, Georgios Georgakis, Kostas Dani- ilidis, Chelsea Finn, and Sergey Levine. Bridge data: Boosting generalization of robotic skills with cross- domain datasets, 2021

work page 2021
[5]

Bridgedata v2: A dataset for robot learning at scale

Homer Walke, Kevin Black, Abraham Lee, Moo Jin Kim, Max Du, Chongyi Zheng, Tony Zhao, Philippe Hansen- Estruch, Quan Vuong, Andre He, Vivek Myers, Kuan Fang, Chelsea Finn, and Sergey Levine. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning (CoRL), 2023

work page 2023
[6]

Chain-of-thought prompting elicits reason- ing in large language models, 2023

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reason- ing in large language models, 2023

work page 2023
[7]

Large language models are zero-shot reasoners, 2023

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yu- taka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners, 2023

work page 2023
[8]

Robotic control via embodied chain-of-thought reasoning

Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning. InConference on Robot Learning, 2024

work page 2024
[9]

Training strategies for efficient embodied reasoning, 2025

William Chen, Suneel Belkhale, Suvir Mirchandani, Oier Mees, Danny Driess, Karl Pertsch, and Sergey Levine. Training strategies for efficient embodied reasoning, 2025

work page 2025
[10]

Openvla: An open-source vision-language-action model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model. 2024

work page 2024
[11]

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pert...

work page 2025
[12]

Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S

Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, Shyamal Buch, Dallas Card, Rodrigo Castellon, Niladri Chatterji, Annie Chen, Kathleen Creel, Jared Quincy Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stef...

work page 2022
[13]

Rt-2: Vision-language-action models transfer web knowl- edge to robotic control, 2023

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Flo- rence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexan- der Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalashn...

work page 2023
[14]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tok- enization for vision-language-action models, 2025. URL https://arxiv.org/abs/2501.09747

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Minivla: A better vla with a smaller footprint, 2024

Suneel Belkhale and Dorsa Sadigh. Minivla: A better vla with a smaller footprint, 2024. URL https://ai.stanford. edu/blog/minivla/

work page 2024
[16]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky.π 0: A vi...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Fine- tuning vision-language-action models: Optimizing speed and success, 2025

Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine- tuning vision-language-action models: Optimizing speed and success, 2025. URL https://arxiv.org/abs/2502. 19645

work page 2025
[18]

Gemini Robotics: Bringing AI into the Physical World

Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gonzalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, Steven Bo- hez, Konstantinos Bousmalis, Anthony Brohan, Thomas Buschmann, Arunkumar Byravan, Serkan Cabi, Ken Caluwaerts, Federico Casarini, Oscar Chang, Jose En- ri...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Cast: Counterfactual labels improve instruction following in vision-language-action models, 2025

Catherine Glossop, William Chen, Arjun Bhorkar, Dhruv Shah, and Sergey Levine. Cast: Counterfactual labels improve instruction following in vision-language-action models, 2025. URL https://arxiv.org/abs/2508.13446

work page arXiv 2025
[20]

Robotic skill acquisition via instruc- tion augmentation with vision-language models, 2022

Ted Xiao, Harris Chan, Pierre Sermanet, Ayzaan Wahid, Anthony Brohan, Karol Hausman, Sergey Levine, and Jonathan Tompson. Robotic skill acquisition via instruc- tion augmentation with vision-language models, 2022

work page 2022
[21]

Steer: Flexible robotic manipulation via dense language grounding, 2024

Laura Smith, Alex Irpan, Montserrat Gonzalez Arenas, Sean Kirmani, Dmitry Kalashnikov, Dhruv Shah, and Ted Xiao. Steer: Flexible robotic manipulation via dense language grounding, 2024. URL https://arxiv.org/abs/ 2411.03409

work page arXiv 2024
[22]

Jesse Zhang, Karl Pertsch, Jiahui Zhang, and Joseph J. Lim. Sprint: Scalable policy pre-training via language instruction relabeling, 2024. URL https://arxiv.org/abs/ 2306.11886

work page arXiv 2024
[23]

Interactive language: Talking to robots in real time, 2022

Corey Lynch, Ayzaan Wahid, Jonathan Tompson, Tianli Ding, James Betker, Robert Baruch, Travis Armstrong, and Pete Florence. Interactive language: Talking to robots in real time, 2022

work page 2022
[24]

R., Ramos, F., Fox, D., Li, A., Gupta, A., and Goyal, A

Yi Li, Yuquan Deng, Jesse Zhang, Joel Jang, Marius Memmel, Raymond Yu, Caelan Reed Garrett, Fabio Ramos, Dieter Fox, Anqi Li, Abhishek Gupta, and Ankit Goyal. Hamster: Hierarchical action models for open- world robot manipulation, 2025. URL https://arxiv.org/ abs/2502.05485

work page arXiv 2025
[25]

Rt-trajectory: Robotic task generalization via hindsight trajectory sketches,

Jiayuan Gu, Sean Kirmani, Paul Wohlhart, Yao Lu, Montserrat Gonzalez Arenas, Kanishka Rao, Wenhao Yu, Chuyuan Fu, Keerthana Gopalakrishnan, Zhuo Xu, Priya Sundaresan, Peng Xu, Hao Su, Karol Hausman, Chelsea Finn, Quan Vuong, and Ted Xiao. Rt-trajectory: Robotic task generalization via hindsight trajectory sketches,

work page
[26]

URL https://arxiv.org/abs/2311.01977

work page arXiv
[27]

TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daum ´e III, Andrey Kolobov, Furong Huang, and Jianwei Yang. Tracevla: Visual trace prompting en- hances spatial-temporal awareness for generalist robotic policies, 2025. URL https://arxiv.org/abs/2412.10345

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Omnivla: An omni-modal vision- language-action model for robot navigation, 2025

Noriaki Hirose, Catherine Glossop, Dhruv Shah, and Sergey Levine. Omnivla: An omni-modal vision- language-action model for robot navigation, 2025. URL https://arxiv.org/abs/2509.19480

work page arXiv 2025
[29]

MolmoAct: Action Reasoning Models that can Reason in Space

Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, Winson Han, Wilbert Pumacay, An- gelica Wu, Rose Hendrix, Karen Farley, Eli VanderBilt, Ali Farhadi, Dieter Fox, and Ranjay Krishna. Molmoact: Action reasoning models that can reason in space, 2025. URL https://arxiv.org/abs/2508.07917

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Im- proving image generation with better captions

James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jian- feng Wang, Linjie Li, LongOuyang, Juntang Zhuang, Joyce Lee, Yufei Guo, Wesam Manassra, Prafulla Dhari- wal, Casey Chu, Yunxin Jiao, and Aditya Ramesh. Im- proving image generation with better captions. 2023

work page 2023
[31]

Integrated task and motion plan- ning, 2020

Caelan Reed Garrett, Rohan Chitnis, Rachel Holladay, Beomjoon Kim, Tom Silver, Leslie Pack Kaelbling, and Tom´as Lozano-P ´erez. Integrated task and motion plan- ning, 2020

work page 2020
[32]

Farrar, Straus and Giroux, 2011

Daniel Kahneman.Thinking, fast and slow. Farrar, Straus and Giroux, 2011

work page 2011
[33]

EMMA: End-to-End Multimodal Model for Autonomous Driving

Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, Yin Zhou, James Guo, Dragomir Anguelov, and Mingxing Tan. Emma: End- to-end multimodal model for autonomous driving, 2024. URL https://arxiv.org/abs/2410.23262

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

Zhao, Archit Sharma, Karl Pertsch, Jianlan Luo, Sergey Levine, and Chelsea Finn

Lucy Xiaoyang Shi, Zheyuan Hu, Tony Z. Zhao, Archit Sharma, Karl Pertsch, Jianlan Luo, Sergey Levine, and Chelsea Finn. Yell at your robot: Improving on-the-fly from language corrections, 2024

work page 2024
[35]

Interactive task planning with language models, 2025 a

Boyi Li, Philipp Wu, Pieter Abbeel, and Jitendra Malik. Interactive task planning with language models, 2025. URL https://arxiv.org/abs/2310.10645

work page arXiv 2025
[36]

Lm-nav: Robotic navigation with large pre- trained models of language, vision, and action, 2022

Dhruv Shah, Blazej Osinski, Brian Ichter, and Sergey Levine. Lm-nav: Robotic navigation with large pre- trained models of language, vision, and action, 2022. URL https://arxiv.org/abs/2207.04429

work page arXiv 2022
[37]

Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,

Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,

work page
[38]

URL https://arxiv.org/abs/2201.07207

work page arXiv
[39]

Inner monologue: Embodied reasoning through planning with language models, 2022

Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Noah Brown, Tomas Jackson, Linda Luu, Sergey Levine, Karol Hausman, and Brian Ichter. Inner monologue: Embodied reasoning through planning with language models, 2022

work page 2022
[40]

Socratic models: Composing zero-shot multimodal reasoning with language, 2022

Andy Zeng, Maria Attarian, Brian Ichter, Krzysztof Choromanski, Adrian Wong, Stefan Welker, Federico Tombari, Aveek Purohit, Michael Ryoo, Vikas Sind- hwani, Johnny Lee, Vincent Vanhoucke, and Pete Flo- rence. Socratic models: Composing zero-shot multimodal reasoning with language, 2022

work page 2022
[41]

Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. Palm-e: An embodied ...

work page 2023
[42]

Jesse Zhang, Jiahui Zhang, Karl Pertsch, Ziyi Liu, Xiang Ren, Minsuk Chang, Shao-Hua Sun, and Joseph J. Lim. Bootstrap your own skills: Learning to solve new tasks with large language model guidance, 2023. URL https: //arxiv.org/abs/2310.10021

work page arXiv 2023
[43]

Scaling up and distilling down: Language-guided robot skill acquisition, 2023

Huy Ha, Pete Florence, and Shuran Song. Scaling up and distilling down: Language-guided robot skill acquisition, 2023

work page 2023
[44]

Rt-h: Action hierar- chies using language, 2024

Suneel Belkhale, Tianli Ding, Ted Xiao, Pierre Sermanet, Quon Vuong, Jonathan Tompson, Yevgen Chebotar, De- bidatta Dwibedi, and Dorsa Sadigh. Rt-h: Action hierar- chies using language, 2024

work page 2024
[45]

Lohovla: A unified vision-language-action model for long-horizon embodied tasks, 2025

Yi Yang, Jiaxuan Sun, Siqi Kou, Yihan Wang, and Zhijie Deng. Lohovla: A unified vision-language-action model for long-horizon embodied tasks, 2025. URL https://arxiv.org/abs/2506.00411

work page arXiv 2025
[46]

From code to action: Hierarchical learning of diffusion- vlm policies, 2025

Markus Peschl, Pietro Mazzaglia, and Daniel Dijkman. From code to action: Hierarchical learning of diffusion- vlm policies, 2025. URL https://arxiv.org/abs/2509. 24917

work page 2025
[47]

Robix: A unified model for robot interaction, reasoning and planning, 2025

Huang Fang, Mengxi Zhang, Heng Dong, Wei Li, Zixuan Wang, Qifeng Zhang, Xueyun Tian, Yucheng Hu, and Hang Li. Robix: A unified model for robot interaction, reasoning and planning, 2025. URL https://arxiv.org/abs/ 2509.01106

work page arXiv 2025
[48]

Galaxea open-world dataset and g0 dual-system vla model.arXiv preprint arXiv:2509.00576, 2025

Tao Jiang, Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Jianning Cui, Xiao Liu, Shuiqi Cheng, Jiyang Gao, Huazhe Xu, and Hang Zhao. Galaxea open-world dataset and g0 dual-system vla model, 2025. URL https://arxiv.org/abs/2509.00576

work page arXiv 2025
[49]

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tri- pathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert, Yv...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

Sam 2: Segment anything in images and videos, 2024

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Rong- hang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R ¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Doll ´ar, and Christoph Feichtenhofer. Sam 2: Segment anything in images and videos, 2024. URL https://arxi...

work page 2024
[51]

End-to-end object detection with transform- ers, 2020

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transform- ers, 2020. URL https://arxiv.org/abs/2005.12872

work page arXiv 2020
[52]

Gemini: A family of highly capable multimodal models, 2024

Gemini Team. Gemini: A family of highly capable multimodal models, 2024

work page 2024
[53]

Tensorrt- openvla, 2025

William Chen, Michał Zawalski, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Tensorrt- openvla, 2025. URL https://github.com/rail-berkeley/ tensorrt-openvla

work page 2025
[54]

Tensorrt-llm

NVIDIA. Tensorrt-llm. https://github.com/NVIDIA/ TensorRT-LLM?tab=readme-ov-file, 2024

work page 2024
[55]

In-context imitation learning via next-token prediction.arXiv preprint arXiv:2408.15980, 2024

Letian Fu, Huang Huang, Gaurav Datta, Lawrence Yun- liang Chen, William Chung-Ho Panitch, Fangchen Liu, Hui Li, and Ken Goldberg. In-context imitation learning via next-token prediction.arXiv preprint arXiv:2408.15980, 2024

work page arXiv 2024
[56]

In-context learning enables robot action prediction in llms, 2025

Yida Yin, Zekai Wang, Yuvan Sharma, Dantong Niu, Trevor Darrell, and Roei Herzig. In-context learning enables robot action prediction in llms, 2025. URL https://arxiv.org/abs/2410.12782

work page arXiv 2025
[57]

Embodiment Collaboration, Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, Albert Tung, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anchit Gupta, Andrew Wang, Andrey Kolobov, Anikait Singh, Animesh Garg, Aniruddha Kembhavi, Anni...

work page 2024
[58]

Octo: An open-source generalist robot policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Charles Xu, Jianlan Luo, Tobias Kreiman, You Liang Tan, Lawrence Yunliang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy. InProceedings of Robotics: Science and...

work page 2024
[59]

Pris- matic vlms: Investigating the design space of visually- conditioned language models, 2024

Siddharth Karamcheti, Suraj Nair, Ashwin Balakrishna, Percy Liang, Thomas Kollar, and Dorsa Sadigh. Pris- matic vlms: Investigating the design space of visually- conditioned language models, 2024

work page 2024
[60]

PaliGemma: A versatile 3B VLM for transfer

Lucas Beyer, Andreas Steiner, Andr ´e Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschan- nen, Emanuele Bugliarello, Thomas Unterthiner, Daniel Keysers, Skanda Koppula, Fangyu Liu, Adam Gryc- ner, Alexey Gritsenko, Neil Houlsby, Manoj Ku- mar, Keran Rong, Julian Eisenschlos, Rishabh Kabra, Matt...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[61]

Diffusion Models Beat GANs on Image Synthesis

Prafulla Dhariwal and Alex Nichol. Diffusion models beat gans on image synthesis, 2021. URL https://arxiv. org/abs/2105.05233

work page internal anchor Pith review Pith/arXiv arXiv 2021
[62]

Classifier-free diffusion guidance, 2022

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance, 2022

work page 2022
[63]

Inference-time policy steering through human interactions, 2024

Yanwei Wang, Lirui Wang, Yilun Du, Balakumar Sun- daralingam, Xuning Yang, Yu-Wei Chao, Claudia Perez- D’Arpino, Dieter Fox, and Julie Shah. Inference-time policy steering through human interactions, 2024

work page 2024
[64]

Steering your generalists: Improving robotic foundation models via value guidance.Confer- ence on Robot Learning (CoRL), 2024

Mitsuhiko Nakamoto, Oier Mees, Aviral Kumar, and Sergey Levine. Steering your generalists: Improving robotic foundation models via value guidance.Confer- ence on Robot Learning (CoRL), 2024

work page 2024
[65]

Steering your diffusion policy with latent space reinforcement learning,

Andrew Wagenmaker, Mitsuhiko Nakamoto, Yunchu Zhang, Seohong Park, Waleed Yagoub, Anusha Naga- bandi, Abhishek Gupta, and Sergey Levine. Steering your diffusion policy with latent space reinforcement learning,

work page
[66]

URL https://arxiv.org/abs/2506.15799

work page arXiv
[67]

Diffusion guidance is a controllable policy im- provement operator.arXiv preprint arXiv:2505.23458,

Kevin Frans, Seohong Park, Pieter Abbeel, and Sergey Levine. Diffusion guidance is a controllable policy improvement operator, 2025. URL https://arxiv.org/abs/ 2505.23458

work page arXiv 2025
[68]

Prefix-Tuning: Optimizing Continuous Prompts for Generation

Xiang Lisa Li and Percy Liang. Prefix-tuning: Opti- mizing continuous prompts for generation, 2021. URL https://arxiv.org/abs/2101.00190

work page internal anchor Pith review Pith/arXiv arXiv 2021
[69]

Robomonkey: Scaling test-time sampling and verification for vision-language-action models, 2025

Jacky Kwok, Christopher Agia, Rohan Sinha, Matt Foutter, Shulu Li, Ion Stoica, Azalia Mirhoseini, and Marco Pavone. Robomonkey: Scaling test-time sampling and verification for vision-language-action models, 2025. URL https://arxiv.org/abs/2506.17811

work page arXiv 2025
[70]

Dynaguide: Steering diffusion policies with active dynamic guidance

Maximilian Du and Shuran Song. Dynaguide: Steering diffusion policies with active dynamic guidance. InPro- ceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS), 2025

work page 2025
[71]

From foresight to forethought: Vlm-in-the-loop policy steering via latent alignment, 2025

Yilin Wu, Ran Tian, Gokul Swamy, and Andrea Bajcsy. From foresight to forethought: Vlm-in-the-loop policy steering via latent alignment, 2025. URL https://arxiv. org/abs/2502.01828

work page arXiv 2025
[72]

Goodman and Michael C

Noah D. Goodman and Michael C. Frank. Pragmatic language interpretation as probabilistic inference.Trends in Cognitive Sciences, 20(11):818–829, 2016. ISSN 1364-6613. doi: https://doi.org/10.1016/j.tics.2016.08

work page doi:10.1016/j.tics.2016.08 2016
[73]

URL https://www.sciencedirect.com/science/article/ pii/S136466131630122X

work page
[74]

LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning, 2023. URL https://arxiv.org/abs/2306.03310

work page internal anchor Pith review Pith/arXiv arXiv 2023
[75]

Libero-pro: Towards robust and fair eval- uation of vision-language-action models beyond memo- rization, 2025

Xueyang Zhou, Yangming Xu, Guiyao Tie, Yongchao Chen, Guowen Zhang, Duanfeng Chu, Pan Zhou, and Lichao Sun. Libero-pro: Towards robust and fair eval- uation of vision-language-action models beyond memo- rization, 2025. URL https://arxiv.org/abs/2510.03827

work page arXiv 2025
[76]

Llama 2: Open foundation and fine-tuned chat models, 2023

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernan- des, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Har...

work page 2023
[77]

Di- nov2: Learning robust visual features without supervi- sion, 2024

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fer- nandez, Daniel Haziza, Francisco Massa, Alaaeldin El- Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv ´e Jegou, Julien Mairal, P...

work page 2024
[78]

Sigmoid loss for language image pre- training, 2023

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre- training, 2023

work page 2023
[79]

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ash- win Balakrishna, Sudeep Dasari, Siddharth Karam- cheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abra- ham Le...

work page 2024
[80]

Ren, Homer Walke, Quan Vuong, Lucy Xiaoyang Shi, and Sergey Levine

Danny Driess, Jost Tobias Springenberg, Brian Ichter, Lili Yu, Adrian Li-Bell, Karl Pertsch, Allen Z. Ren, Homer Walke, Quan Vuong, Lucy Xiaoyang Shi, and Sergey Levine. Knowledge insulating vision-language- action models: Train fast, run fast, generalize better,

work page

Showing first 80 references.