Recognition: no theorem link
Steerable Vision-Language-Action Policies for Embodied Reasoning and Hierarchical Control
Pith reviewed 2026-05-15 22:02 UTC · model grok-4.3
The pith
Steerable Policies trained on multi-level synthetic commands let VLMs steer low-level robot actions more precisely and improve generalization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Steerable Policies are VLAs trained on rich synthetic commands at multiple abstraction levels, including subtasks, motions, and grounded pixel coordinates. This training produces a low-level policy that high-level VLMs or learned reasoners can steer through these explicit command abstractions. When the resulting system is tested on real-world manipulation, both the learned-reasoner variant and the prompted-VLM variant outperform prior embodied-reasoning VLAs and VLM-based hierarchical baselines, with the largest gains on tasks that require generalization or long horizons.
What carries the argument
Steerable Policies: VLAs trained on synthetic multi-level commands (subtasks, motions, grounded pixel coordinates) that serve as a controllable interface for high-level VLMs to steer low-level robot behavior.
If this is right
- Steerable Policies controlled by a learned high-level embodied reasoner outperform prior methods on manipulation tasks.
- Off-the-shelf VLMs prompted to reason over command abstractions via in-context learning can also steer Steerable Policies effectively.
- The approach yields larger gains on challenging generalization and long-horizon tasks than on simple ones.
- Improved low-level controllability allows pretrained VLM knowledge to transfer more successfully into robot behavior.
Where Pith is reading between the lines
- The same multi-level command training could be applied to other hierarchical robot systems that currently rely on natural language interfaces.
- If synthetic command training scales, it may reduce dependence on large amounts of real-world robot data for low-level policy learning.
- The method invites testing whether different VLM sizes or architectures benefit unequally from the added controllability.
Load-bearing premise
Training on synthetic multi-level commands transfers to real robot execution without a large domain gap, and the richer command set actually lets VLMs steer behavior in ways that improve generalization beyond natural language alone.
What would settle it
Real-world trials in which the Steerable Policy controlled by a VLM shows no improvement or worse performance than a standard VLA using only natural language commands on held-out generalization tasks would falsify the central claim.
Figures
read the original abstract
Pretrained vision-language models (VLMs) can make semantic and visual inferences across diverse settings, providing valuable common-sense priors for robotic control. However, effectively grounding this knowledge in robot behaviors remains an open challenge. Prior methods often employ a hierarchical approach where VLMs reason over high-level commands to be executed by separate low-level policies, e.g., vision-language-action models (VLAs). The interface between VLMs and VLAs is usually natural language task instructions, which fundamentally limits how much VLM reasoning can steer low-level behavior. We thus introduce Steerable Policies: VLAs trained on rich synthetic commands at various levels of abstraction, like subtasks, motions, and grounded pixel coordinates. By improving low-level controllability, Steerable Policies can unlock pretrained knowledge in VLMs, enabling improved task generalization. We demonstrate this benefit by controlling our Steerable Policies with both a learned high-level embodied reasoner and an off-the-shelf VLM prompted to reason over command abstractions via in-context learning. Across extensive real-world manipulation experiments, these two novel methods outperform prior embodied reasoning VLAs and VLM-based hierarchical baselines, including on challenging generalization and long-horizon tasks. Website: steerable-policies.github.io
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Steerable Policies as VLAs trained on rich synthetic commands at multiple abstraction levels (subtasks, motions, grounded pixel coordinates) to enhance low-level controllability. This richer interface is claimed to unlock pretrained VLM knowledge for better embodied reasoning and hierarchical control. The work evaluates two control methods—a learned high-level embodied reasoner and an off-the-shelf VLM using in-context learning over command abstractions—and reports that both outperform prior VLAs and VLM-based hierarchical baselines on real-world manipulation tasks, including generalization and long-horizon scenarios.
Significance. If the empirical claims hold with proper validation, the result would be significant for embodied AI by demonstrating a practical way to bridge high-level VLM reasoning with low-level robot control via synthetic multi-level commands, potentially improving generalization without heavy real-world fine-tuning. The use of both learned and prompted VLM controllers is a notable strength, as is the focus on real hardware evaluation.
major comments (2)
- [Abstract] Abstract: The central claim that Steerable Policies unlock VLM knowledge and yield outperformance on real-world experiments is stated without any quantitative metrics, success rates, error bars, data splits, baseline details, or statistical significance. This is load-bearing because the reported gains could arise solely from low-level policy improvements rather than the richer command interface enabling better VLM steering.
- [Experiments] Experiments section: No evidence is provided for command-following success rates on held-out real trajectories or ablations isolating the effect of command richness (e.g., multi-level vs. natural language only). This directly undermines validation of the weakest assumption that synthetic training transfers without domain gap and that the additional channels transmit useful VLM reasoning.
minor comments (1)
- [Abstract] Abstract: The provided website link is useful but the summary text does not reference specific figures, tables, or sections containing the quantitative results needed to support the claims.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight important opportunities to strengthen the presentation of quantitative evidence and experimental validation. We address each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that Steerable Policies unlock VLM knowledge and yield outperformance on real-world experiments is stated without any quantitative metrics, success rates, error bars, data splits, baseline details, or statistical significance. This is load-bearing because the reported gains could arise solely from low-level policy improvements rather than the richer command interface enabling better VLM steering.
Authors: We agree that the abstract would benefit from explicit quantitative support. In the revised version we will incorporate key success rates from the real-world experiments (with error bars), baseline comparisons, and a brief note on data splits and statistical testing. We will also explicitly reference the ablations in the experiments section that isolate the contribution of the multi-level command interface, thereby clarifying that gains are not attributable solely to low-level policy improvements. revision: yes
-
Referee: [Experiments] Experiments section: No evidence is provided for command-following success rates on held-out real trajectories or ablations isolating the effect of command richness (e.g., multi-level vs. natural language only). This directly undermines validation of the weakest assumption that synthetic training transfers without domain gap and that the additional channels transmit useful VLM reasoning.
Authors: We will add a new subsection in the experiments that reports command-following success rates on held-out real-world trajectories. We will also include explicit ablations that compare multi-level synthetic commands against natural-language-only interfaces, directly measuring the incremental benefit of command richness. These additions will provide the requested evidence for synthetic-to-real transfer and the utility of the additional steering channels. revision: yes
Circularity Check
No circularity: empirical method with external real-world validation
full rationale
The paper proposes training VLAs on synthetic multi-level commands (subtasks, motions, pixel coordinates) and evaluates the resulting steerable policies via real-robot experiments with both learned reasoners and prompted VLMs. No mathematical derivation chain exists; claims rest on empirical outperformance rather than any equation or parameter that reduces to its own inputs by construction. No self-citation is invoked as a uniqueness theorem or load-bearing premise. The central benefit (improved VLM steering via richer interfaces) is tested against baselines on held-out real tasks, satisfying the requirement for independent external evidence.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
Breaking Lock-In: Preserving Steerability under Low-Data VLA Post-Training
DeLock mitigates lock-in in low-data VLA post-training via visual grounding preservation and test-time contrastive prompt guidance, outperforming baselines across eight evaluations while matching data-heavy generalist...
Reference graph
Works this paper leans on
-
[1]
Do as i can, not as i say: Grounding language in robotic affordances, 2022
Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Haus- man, Alex Herzog, Daniel Ho, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Eric Jang, Rosario Jauregui Ruano, Kyle Jeffrey, Sally Jesmonth, Nikhil J Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang,...
work page 2022
-
[2]
Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models
Lucy Xiaoyang Shi, Brian Ichter, Michael Equi, Liyiming Ke, Karl Pertsch, Quan Vuong, James Tanner, Anna Walling, Haohuan Wang, Niccolo Fusai, Adrian Li-Bell, Danny Driess, Lachy Groom, Sergey Levine, and Chelsea Finn. Hi robot: Open-ended instruction following with hierarchical vision-language-action models, 2025. URL https://arxiv.org/abs/2502.19417
work page internal anchor Pith review arXiv 2025
-
[3]
Limited Linguistic Diversity in Embodied AI Datasets
Selma Wanna, Agnes Luhtaru, Jonathan Salfity, Ryan Barron, Juston Moore, Cynthia Matuszek, and Mitch Pryor. Limited linguistic diversity in embodied ai datasets, 2026. URL https://arxiv.org/abs/2601.03136
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[4]
Bridge data: Boosting generalization of robotic skills with cross- domain datasets, 2021
Frederik Ebert, Yanlai Yang, Karl Schmeckpeper, Bernadette Bucher, Georgios Georgakis, Kostas Dani- ilidis, Chelsea Finn, and Sergey Levine. Bridge data: Boosting generalization of robotic skills with cross- domain datasets, 2021
work page 2021
-
[5]
Bridgedata v2: A dataset for robot learning at scale
Homer Walke, Kevin Black, Abraham Lee, Moo Jin Kim, Max Du, Chongyi Zheng, Tony Zhao, Philippe Hansen- Estruch, Quan Vuong, Andre He, Vivek Myers, Kuan Fang, Chelsea Finn, and Sergey Levine. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning (CoRL), 2023
work page 2023
-
[6]
Chain-of-thought prompting elicits reason- ing in large language models, 2023
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reason- ing in large language models, 2023
work page 2023
-
[7]
Large language models are zero-shot reasoners, 2023
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yu- taka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners, 2023
work page 2023
-
[8]
Robotic control via embodied chain-of-thought reasoning
Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning. InConference on Robot Learning, 2024
work page 2024
-
[9]
Training strategies for efficient embodied reasoning, 2025
William Chen, Suneel Belkhale, Suvir Mirchandani, Oier Mees, Danny Driess, Karl Pertsch, and Sergey Levine. Training strategies for efficient embodied reasoning, 2025
work page 2025
-
[10]
Openvla: An open-source vision-language-action model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model. 2024
work page 2024
-
[11]
Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pert...
work page 2025
-
[12]
Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S
Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, Shyamal Buch, Dallas Card, Rodrigo Castellon, Niladri Chatterji, Annie Chen, Kathleen Creel, Jared Quincy Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stef...
work page 2022
-
[13]
Rt-2: Vision-language-action models transfer web knowl- edge to robotic control, 2023
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Flo- rence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexan- der Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalashn...
work page 2023
-
[14]
FAST: Efficient Action Tokenization for Vision-Language-Action Models
Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tok- enization for vision-language-action models, 2025. URL https://arxiv.org/abs/2501.09747
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Minivla: A better vla with a smaller footprint, 2024
Suneel Belkhale and Dorsa Sadigh. Minivla: A better vla with a smaller footprint, 2024. URL https://ai.stanford. edu/blog/minivla/
work page 2024
-
[16]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
Kevin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky.π 0: A vi...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
Fine- tuning vision-language-action models: Optimizing speed and success, 2025
Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine- tuning vision-language-action models: Optimizing speed and success, 2025. URL https://arxiv.org/abs/2502. 19645
work page 2025
-
[18]
Gemini Robotics: Bringing AI into the Physical World
Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gonzalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, Steven Bo- hez, Konstantinos Bousmalis, Anthony Brohan, Thomas Buschmann, Arunkumar Byravan, Serkan Cabi, Ken Caluwaerts, Federico Casarini, Oscar Chang, Jose En- ri...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
Cast: Counterfactual labels improve instruction following in vision-language-action models, 2025
Catherine Glossop, William Chen, Arjun Bhorkar, Dhruv Shah, and Sergey Levine. Cast: Counterfactual labels improve instruction following in vision-language-action models, 2025. URL https://arxiv.org/abs/2508.13446
-
[20]
Robotic skill acquisition via instruc- tion augmentation with vision-language models, 2022
Ted Xiao, Harris Chan, Pierre Sermanet, Ayzaan Wahid, Anthony Brohan, Karol Hausman, Sergey Levine, and Jonathan Tompson. Robotic skill acquisition via instruc- tion augmentation with vision-language models, 2022
work page 2022
-
[21]
Steer: Flexible robotic manipulation via dense language grounding, 2024
Laura Smith, Alex Irpan, Montserrat Gonzalez Arenas, Sean Kirmani, Dmitry Kalashnikov, Dhruv Shah, and Ted Xiao. Steer: Flexible robotic manipulation via dense language grounding, 2024. URL https://arxiv.org/abs/ 2411.03409
- [22]
-
[23]
Interactive language: Talking to robots in real time, 2022
Corey Lynch, Ayzaan Wahid, Jonathan Tompson, Tianli Ding, James Betker, Robert Baruch, Travis Armstrong, and Pete Florence. Interactive language: Talking to robots in real time, 2022
work page 2022
-
[24]
R., Ramos, F., Fox, D., Li, A., Gupta, A., and Goyal, A
Yi Li, Yuquan Deng, Jesse Zhang, Joel Jang, Marius Memmel, Raymond Yu, Caelan Reed Garrett, Fabio Ramos, Dieter Fox, Anqi Li, Abhishek Gupta, and Ankit Goyal. Hamster: Hierarchical action models for open- world robot manipulation, 2025. URL https://arxiv.org/ abs/2502.05485
-
[25]
Rt-trajectory: Robotic task generalization via hindsight trajectory sketches,
Jiayuan Gu, Sean Kirmani, Paul Wohlhart, Yao Lu, Montserrat Gonzalez Arenas, Kanishka Rao, Wenhao Yu, Chuyuan Fu, Keerthana Gopalakrishnan, Zhuo Xu, Priya Sundaresan, Peng Xu, Hao Su, Karol Hausman, Chelsea Finn, Quan Vuong, and Ted Xiao. Rt-trajectory: Robotic task generalization via hindsight trajectory sketches,
- [26]
-
[27]
TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies
Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daum ´e III, Andrey Kolobov, Furong Huang, and Jianwei Yang. Tracevla: Visual trace prompting en- hances spatial-temporal awareness for generalist robotic policies, 2025. URL https://arxiv.org/abs/2412.10345
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[28]
Omnivla: An omni-modal vision- language-action model for robot navigation, 2025
Noriaki Hirose, Catherine Glossop, Dhruv Shah, and Sergey Levine. Omnivla: An omni-modal vision- language-action model for robot navigation, 2025. URL https://arxiv.org/abs/2509.19480
-
[29]
MolmoAct: Action Reasoning Models that can Reason in Space
Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, Winson Han, Wilbert Pumacay, An- gelica Wu, Rose Hendrix, Karen Farley, Eli VanderBilt, Ali Farhadi, Dieter Fox, and Ranjay Krishna. Molmoact: Action reasoning models that can reason in space, 2025. URL https://arxiv.org/abs/2508.07917
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
Im- proving image generation with better captions
James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jian- feng Wang, Linjie Li, LongOuyang, Juntang Zhuang, Joyce Lee, Yufei Guo, Wesam Manassra, Prafulla Dhari- wal, Casey Chu, Yunxin Jiao, and Aditya Ramesh. Im- proving image generation with better captions. 2023
work page 2023
-
[31]
Integrated task and motion plan- ning, 2020
Caelan Reed Garrett, Rohan Chitnis, Rachel Holladay, Beomjoon Kim, Tom Silver, Leslie Pack Kaelbling, and Tom´as Lozano-P ´erez. Integrated task and motion plan- ning, 2020
work page 2020
-
[32]
Farrar, Straus and Giroux, 2011
Daniel Kahneman.Thinking, fast and slow. Farrar, Straus and Giroux, 2011
work page 2011
-
[33]
EMMA: End-to-End Multimodal Model for Autonomous Driving
Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, Yin Zhou, James Guo, Dragomir Anguelov, and Mingxing Tan. Emma: End- to-end multimodal model for autonomous driving, 2024. URL https://arxiv.org/abs/2410.23262
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
Zhao, Archit Sharma, Karl Pertsch, Jianlan Luo, Sergey Levine, and Chelsea Finn
Lucy Xiaoyang Shi, Zheyuan Hu, Tony Z. Zhao, Archit Sharma, Karl Pertsch, Jianlan Luo, Sergey Levine, and Chelsea Finn. Yell at your robot: Improving on-the-fly from language corrections, 2024
work page 2024
-
[35]
Interactive task planning with language models, 2025 a
Boyi Li, Philipp Wu, Pieter Abbeel, and Jitendra Malik. Interactive task planning with language models, 2025. URL https://arxiv.org/abs/2310.10645
-
[36]
Lm-nav: Robotic navigation with large pre- trained models of language, vision, and action, 2022
Dhruv Shah, Blazej Osinski, Brian Ichter, and Sergey Levine. Lm-nav: Robotic navigation with large pre- trained models of language, vision, and action, 2022. URL https://arxiv.org/abs/2207.04429
-
[37]
Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,
Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,
- [38]
-
[39]
Inner monologue: Embodied reasoning through planning with language models, 2022
Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Noah Brown, Tomas Jackson, Linda Luu, Sergey Levine, Karol Hausman, and Brian Ichter. Inner monologue: Embodied reasoning through planning with language models, 2022
work page 2022
-
[40]
Socratic models: Composing zero-shot multimodal reasoning with language, 2022
Andy Zeng, Maria Attarian, Brian Ichter, Krzysztof Choromanski, Adrian Wong, Stefan Welker, Federico Tombari, Aveek Purohit, Michael Ryoo, Vikas Sind- hwani, Johnny Lee, Vincent Vanhoucke, and Pete Flo- rence. Socratic models: Composing zero-shot multimodal reasoning with language, 2022
work page 2022
-
[41]
Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. Palm-e: An embodied ...
work page 2023
- [42]
-
[43]
Scaling up and distilling down: Language-guided robot skill acquisition, 2023
Huy Ha, Pete Florence, and Shuran Song. Scaling up and distilling down: Language-guided robot skill acquisition, 2023
work page 2023
-
[44]
Rt-h: Action hierar- chies using language, 2024
Suneel Belkhale, Tianli Ding, Ted Xiao, Pierre Sermanet, Quon Vuong, Jonathan Tompson, Yevgen Chebotar, De- bidatta Dwibedi, and Dorsa Sadigh. Rt-h: Action hierar- chies using language, 2024
work page 2024
-
[45]
Lohovla: A unified vision-language-action model for long-horizon embodied tasks, 2025
Yi Yang, Jiaxuan Sun, Siqi Kou, Yihan Wang, and Zhijie Deng. Lohovla: A unified vision-language-action model for long-horizon embodied tasks, 2025. URL https://arxiv.org/abs/2506.00411
-
[46]
From code to action: Hierarchical learning of diffusion- vlm policies, 2025
Markus Peschl, Pietro Mazzaglia, and Daniel Dijkman. From code to action: Hierarchical learning of diffusion- vlm policies, 2025. URL https://arxiv.org/abs/2509. 24917
work page 2025
-
[47]
Robix: A unified model for robot interaction, reasoning and planning, 2025
Huang Fang, Mengxi Zhang, Heng Dong, Wei Li, Zixuan Wang, Qifeng Zhang, Xueyun Tian, Yucheng Hu, and Hang Li. Robix: A unified model for robot interaction, reasoning and planning, 2025. URL https://arxiv.org/abs/ 2509.01106
-
[48]
Galaxea open-world dataset and g0 dual-system vla model.arXiv preprint arXiv:2509.00576, 2025
Tao Jiang, Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Jianning Cui, Xiao Liu, Shuiqi Cheng, Jiyang Gao, Huazhe Xu, and Hang Zhao. Galaxea open-world dataset and g0 dual-system vla model, 2025. URL https://arxiv.org/abs/2509.00576
-
[49]
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tri- pathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert, Yv...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[50]
Sam 2: Segment anything in images and videos, 2024
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Rong- hang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R ¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Doll ´ar, and Christoph Feichtenhofer. Sam 2: Segment anything in images and videos, 2024. URL https://arxi...
work page 2024
-
[51]
End-to-end object detection with transform- ers, 2020
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transform- ers, 2020. URL https://arxiv.org/abs/2005.12872
-
[52]
Gemini: A family of highly capable multimodal models, 2024
Gemini Team. Gemini: A family of highly capable multimodal models, 2024
work page 2024
-
[53]
William Chen, Michał Zawalski, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Tensorrt- openvla, 2025. URL https://github.com/rail-berkeley/ tensorrt-openvla
work page 2025
-
[54]
NVIDIA. Tensorrt-llm. https://github.com/NVIDIA/ TensorRT-LLM?tab=readme-ov-file, 2024
work page 2024
-
[55]
In-context imitation learning via next-token prediction.arXiv preprint arXiv:2408.15980, 2024
Letian Fu, Huang Huang, Gaurav Datta, Lawrence Yun- liang Chen, William Chung-Ho Panitch, Fangchen Liu, Hui Li, and Ken Goldberg. In-context imitation learning via next-token prediction.arXiv preprint arXiv:2408.15980, 2024
-
[56]
In-context learning enables robot action prediction in llms, 2025
Yida Yin, Zekai Wang, Yuvan Sharma, Dantong Niu, Trevor Darrell, and Roei Herzig. In-context learning enables robot action prediction in llms, 2025. URL https://arxiv.org/abs/2410.12782
-
[57]
Embodiment Collaboration, Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, Albert Tung, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anchit Gupta, Andrew Wang, Andrey Kolobov, Anikait Singh, Animesh Garg, Aniruddha Kembhavi, Anni...
work page 2024
-
[58]
Octo: An open-source generalist robot policy
Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Charles Xu, Jianlan Luo, Tobias Kreiman, You Liang Tan, Lawrence Yunliang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy. InProceedings of Robotics: Science and...
work page 2024
-
[59]
Pris- matic vlms: Investigating the design space of visually- conditioned language models, 2024
Siddharth Karamcheti, Suraj Nair, Ashwin Balakrishna, Percy Liang, Thomas Kollar, and Dorsa Sadigh. Pris- matic vlms: Investigating the design space of visually- conditioned language models, 2024
work page 2024
-
[60]
PaliGemma: A versatile 3B VLM for transfer
Lucas Beyer, Andreas Steiner, Andr ´e Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschan- nen, Emanuele Bugliarello, Thomas Unterthiner, Daniel Keysers, Skanda Koppula, Fangyu Liu, Adam Gryc- ner, Alexey Gritsenko, Neil Houlsby, Manoj Ku- mar, Keran Rong, Julian Eisenschlos, Rishabh Kabra, Matt...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[61]
Diffusion Models Beat GANs on Image Synthesis
Prafulla Dhariwal and Alex Nichol. Diffusion models beat gans on image synthesis, 2021. URL https://arxiv. org/abs/2105.05233
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[62]
Classifier-free diffusion guidance, 2022
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance, 2022
work page 2022
-
[63]
Inference-time policy steering through human interactions, 2024
Yanwei Wang, Lirui Wang, Yilun Du, Balakumar Sun- daralingam, Xuning Yang, Yu-Wei Chao, Claudia Perez- D’Arpino, Dieter Fox, and Julie Shah. Inference-time policy steering through human interactions, 2024
work page 2024
-
[64]
Mitsuhiko Nakamoto, Oier Mees, Aviral Kumar, and Sergey Levine. Steering your generalists: Improving robotic foundation models via value guidance.Confer- ence on Robot Learning (CoRL), 2024
work page 2024
-
[65]
Steering your diffusion policy with latent space reinforcement learning,
Andrew Wagenmaker, Mitsuhiko Nakamoto, Yunchu Zhang, Seohong Park, Waleed Yagoub, Anusha Naga- bandi, Abhishek Gupta, and Sergey Levine. Steering your diffusion policy with latent space reinforcement learning,
- [66]
-
[67]
Diffusion guidance is a controllable policy im- provement operator.arXiv preprint arXiv:2505.23458,
Kevin Frans, Seohong Park, Pieter Abbeel, and Sergey Levine. Diffusion guidance is a controllable policy improvement operator, 2025. URL https://arxiv.org/abs/ 2505.23458
-
[68]
Prefix-Tuning: Optimizing Continuous Prompts for Generation
Xiang Lisa Li and Percy Liang. Prefix-tuning: Opti- mizing continuous prompts for generation, 2021. URL https://arxiv.org/abs/2101.00190
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[69]
Robomonkey: Scaling test-time sampling and verification for vision-language-action models, 2025
Jacky Kwok, Christopher Agia, Rohan Sinha, Matt Foutter, Shulu Li, Ion Stoica, Azalia Mirhoseini, and Marco Pavone. Robomonkey: Scaling test-time sampling and verification for vision-language-action models, 2025. URL https://arxiv.org/abs/2506.17811
-
[70]
Dynaguide: Steering diffusion policies with active dynamic guidance
Maximilian Du and Shuran Song. Dynaguide: Steering diffusion policies with active dynamic guidance. InPro- ceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS), 2025
work page 2025
-
[71]
From foresight to forethought: Vlm-in-the-loop policy steering via latent alignment, 2025
Yilin Wu, Ran Tian, Gokul Swamy, and Andrea Bajcsy. From foresight to forethought: Vlm-in-the-loop policy steering via latent alignment, 2025. URL https://arxiv. org/abs/2502.01828
-
[72]
Noah D. Goodman and Michael C. Frank. Pragmatic language interpretation as probabilistic inference.Trends in Cognitive Sciences, 20(11):818–829, 2016. ISSN 1364-6613. doi: https://doi.org/10.1016/j.tics.2016.08
-
[73]
URL https://www.sciencedirect.com/science/article/ pii/S136466131630122X
-
[74]
LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning
Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning, 2023. URL https://arxiv.org/abs/2306.03310
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[75]
Xueyang Zhou, Yangming Xu, Guiyao Tie, Yongchao Chen, Guowen Zhang, Duanfeng Chu, Pan Zhou, and Lichao Sun. Libero-pro: Towards robust and fair eval- uation of vision-language-action models beyond memo- rization, 2025. URL https://arxiv.org/abs/2510.03827
-
[76]
Llama 2: Open foundation and fine-tuned chat models, 2023
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernan- des, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Har...
work page 2023
-
[77]
Di- nov2: Learning robust visual features without supervi- sion, 2024
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fer- nandez, Daniel Haziza, Francisco Massa, Alaaeldin El- Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv ´e Jegou, Julien Mairal, P...
work page 2024
-
[78]
Sigmoid loss for language image pre- training, 2023
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre- training, 2023
work page 2023
-
[79]
Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ash- win Balakrishna, Sudeep Dasari, Siddharth Karam- cheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abra- ham Le...
work page 2024
-
[80]
Ren, Homer Walke, Quan Vuong, Lucy Xiaoyang Shi, and Sergey Levine
Danny Driess, Jost Tobias Springenberg, Brian Ichter, Lili Yu, Adrian Li-Bell, Karl Pertsch, Allen Z. Ren, Homer Walke, Quan Vuong, Lucy Xiaoyang Shi, and Sergey Levine. Knowledge insulating vision-language- action models: Train fast, run fast, generalize better,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.