arxiv: 2503.10631 · v3 · submitted 2025-03-13 · 💻 cs.CV · cs.RO

Recognition: no theorem link

HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model

Jiaming Liu , Hao Chen , Pengju An , Zhuoyang Liu , Renrui Zhang , Chenyang Gu , Xiaoqi Li , Ziyu Guo

show 7 more authors

Sixiang Chen Mengzhen Liu Chengkai Hou Mengdi Zhao KC alex Zhou Pheng-Ann Heng Shanghang Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 21:57 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords HybridVLAvision-language-actiondiffusionautoregressionrobot manipulationcollaborative trainingaction ensemble

0 comments

The pith

HybridVLA unifies diffusion for continuous actions and autoregression for reasoning inside one vision-language-action model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a single model that combines the strengths of diffusion-based methods for smooth continuous control with autoregressive vision-language models for contextual reasoning. Previous approaches either quantized actions disrupting continuity or used diffusion only on top of VLM features without full integration. HybridVLA integrates both paradigms through a collaborative training process where diffusion denoising is incorporated into next-token prediction. This allows the methods to reinforce each other, followed by an adaptive ensemble of their predictions for final actions. The result is improved success rates on manipulation tasks in both simulation and real-world settings.

Core claim

HybridVLA is a unified framework within a single large language model that absorbs the continuous nature of diffusion-based actions and the contextual reasoning of autoregression. A collaborative training recipe seamlessly incorporates diffusion denoising into the next-token prediction process, enabling the two action prediction methods to reinforce each other. A collaborative action ensemble mechanism then adaptively fuses both predictions for more robust control.

What carries the argument

The collaborative training recipe that incorporates diffusion denoising into next-token prediction, allowing mutual reinforcement and followed by adaptive ensemble of predictions.

If this is right

Robots achieve higher mean success rates by 14% in simulation and 19% in real-world tasks compared to prior VLA methods.
The model demonstrates stable manipulation in unseen configurations due to the combined continuous and reasoning capabilities.
Action prediction methods vary in strength across tasks, making the ensemble beneficial for robustness.
The unified approach fully leverages pretrained VLM reasoning through token-level generation alongside continuous action output.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar unification could be applied to other generation tasks combining discrete and continuous outputs.
Future work might explore scaling this to larger models or more complex environments.
The adaptive ensemble suggests potential for dynamic weighting based on task type.

Load-bearing premise

The collaborative training recipe successfully prevents interference between diffusion denoising and next-token prediction while allowing the two to reinforce each other across tasks.

What would settle it

An ablation study showing that removing the collaborative training or ensemble leads to no improvement or degradation compared to separate diffusion or autoregressive baselines would falsify the central claim.

read the original abstract

A fundamental objective of manipulation policy design is to endow robots to comprehend human instructions, reason about scene cues, and execute generalized actions in dynamic environments. Recent autoregressive vision-language-action (VLA) methods inherit common-sense reasoning capabilities from vision-language models (VLMs) for next action-token prediction. However, these methods quantize actions into discrete bins, which disrupts the continuity required for precise control. In contrast, existing diffusion-based VLA methods incorporate an additional diffusion head to predict continuous actions solely conditioned on feature representations extracted by the VLM, without fully leveraging the VLM's pretrained reasoning capabilities through token-level generation. To address these limitations, we introduce HybridVLA, a unified framework that absorbs the continuous nature of diffusion-based actions and the contextual reasoning of autoregression within a single large language model. To mitigate interference between the two generation paradigms, we propose a collaborative training recipe that seamlessly incorporates diffusion denoising into the next-token prediction process. With this recipe, we find these two action prediction methods not only reinforce each other but also exhibit varying strength across different tasks. Therefore, we design a collaborative action ensemble mechanism that adaptively fuses both predictions, leading to more robust control. HybridVLA outperforms previous state-of-the-art VLA methods by 14\% and 19\% in mean success rate on simulation and real-world tasks, respectively, while demonstrating stable manipulation in unseen configurations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HybridVLA puts diffusion and autoregressive heads in one VLA with a collaborative training recipe and adaptive ensemble, but the gains lack isolating ablations.

read the letter

The main thing to know is that this paper builds a single LLM-based VLA that runs both diffusion denoising for continuous actions and autoregressive next-token prediction together. It adds a collaborative training recipe to mix the denoising steps into the token prediction process and an adaptive ensemble that fuses the two outputs depending on the task. The reported outcome is a 14% lift in simulation success rate and 19% in real-world tasks over prior VLA methods, plus better handling of unseen configurations.

Referee Report

2 major / 2 minor

Summary. The paper introduces HybridVLA, a unified vision-language-action (VLA) model that integrates diffusion-based continuous action generation with autoregressive next-token prediction inside a single large language model. It proposes a collaborative training recipe to fold diffusion denoising into the token-prediction objective and an adaptive ensemble mechanism that fuses the two outputs, claiming 14% and 19% gains in mean success rate over prior SOTA VLA methods on simulation and real-world manipulation tasks, respectively, together with improved robustness on unseen configurations.

Significance. If the reported gains can be shown to arise from the claimed synergy rather than capacity or training artifacts, the work would offer a practical way to combine the reasoning strengths of pretrained VLMs with the continuity of diffusion policies, potentially advancing unified VLA architectures for precise robotic control.

major comments (2)

[Abstract / Experiments] Abstract and Experiments section: the central claim that the collaborative training recipe and ensemble produce mutual reinforcement (rather than interference or simple capacity increase) rests on the 14%/19% gains, yet no ablation results are supplied for diffusion-head-only, autoregressive-head-only, or non-adaptive fusion baselines on the same tasks and metrics. Without these comparisons the attribution of performance to the hybrid design cannot be verified.
[Abstract] Abstract: the statement that the two paradigms 'reinforce each other' and 'exhibit varying strength across different tasks' is presented without supporting quantitative evidence (e.g., per-task success rates for each head or correlation between head outputs), leaving the motivation for the adaptive ensemble ungrounded.

minor comments (2)

[Method] The description of the collaborative training recipe would benefit from an explicit loss equation or training schedule diagram showing how diffusion denoising steps are interleaved with next-token prediction.
[Experiments] Table or figure captions should explicitly list the exact simulation and real-world benchmarks, number of trials, and statistical significance tests used to support the 14% and 19% mean-success-rate deltas.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that stronger empirical support is required to attribute the reported gains specifically to the collaborative training and adaptive ensemble rather than capacity or training effects. We will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and Experiments section: the central claim that the collaborative training recipe and ensemble produce mutual reinforcement (rather than interference or simple capacity increase) rests on the 14%/19% gains, yet no ablation results are supplied for diffusion-head-only, autoregressive-head-only, or non-adaptive fusion baselines on the same tasks and metrics. Without these comparisons the attribution of performance to the hybrid design cannot be verified.

Authors: We agree that the current manuscript lacks the requested internal ablations. In the revised version we will add results for (i) diffusion-head-only, (ii) autoregressive-head-only, and (iii) non-adaptive (e.g., fixed-weight) fusion baselines, all trained and evaluated under identical conditions on the same simulation and real-world tasks. These comparisons will clarify whether the observed improvements stem from the proposed collaborative recipe and adaptive ensemble. revision: yes
Referee: [Abstract] Abstract: the statement that the two paradigms 'reinforce each other' and 'exhibit varying strength across different tasks' is presented without supporting quantitative evidence (e.g., per-task success rates for each head or correlation between head outputs), leaving the motivation for the adaptive ensemble ungrounded.

Authors: We acknowledge the absence of per-task breakdowns and correlation analysis in the submitted version. The revised manuscript will include (a) per-task success rates for each head individually and (b) quantitative measures of output agreement or complementarity between the two heads. These additions will directly support the claim of varying strengths and the rationale for adaptive fusion. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents HybridVLA as an empirical architecture combining diffusion and autoregressive heads within a VLM backbone, with a collaborative training recipe and adaptive ensemble as design choices. No equations, self-definitional reductions, or load-bearing self-citations appear that would make the reported 14%/19% success-rate gains equivalent to fitted inputs or prior results by construction. Performance claims are grounded in external task benchmarks rather than internal redefinitions, rendering the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

The central claim rests on the unstated premise that diffusion and autoregressive heads can be trained jointly inside one LLM without destructive interference and that an adaptive ensemble will reliably select the stronger prediction per task. No free parameters, axioms, or invented entities are explicitly quantified in the abstract.

invented entities (2)

Collaborative training recipe no independent evidence
purpose: Seamlessly incorporate diffusion denoising into next-token prediction
Introduced to mitigate interference between paradigms
Collaborative action ensemble mechanism no independent evidence
purpose: Adaptively fuse diffusion and autoregressive predictions
Designed to exploit varying strengths across tasks

pith-pipeline@v0.9.0 · 5604 in / 1133 out tokens · 36807 ms · 2026-05-15T21:57:10.964919+00:00 · methodology

discussion (0)

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Test-time Sparsity for Extreme Fast Action Diffusion
cs.CV 2026-05 unverdicted novelty 7.0

Test-time sparsity with a parallel pipeline and omnidirectional feature reuse accelerates action diffusion by 5x to 47.5 Hz while cutting FLOPs 92% with no performance loss.
From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation
cs.RO 2026-05 unverdicted novelty 7.0

MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.
Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models
cs.RO 2026-05 unverdicted novelty 7.0

Pace-and-Path Correction decomposes a quadratic cost minimization into orthogonal pace and path channels to correct chunked actions in VLA models, raising success rates by up to 28.8% in dynamic settings.
VistaBot: View-Robust Robot Manipulation via Spatiotemporal-Aware View Synthesis
cs.RO 2026-04 unverdicted novelty 7.0

VistaBot integrates 4D geometry estimation and spatiotemporal view synthesis into action policies to improve cross-view generalization by 2.6-2.8x on a new VGS metric in simulation and real tasks.
Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation
cs.RO 2026-02 unverdicted novelty 7.0

PhysGen uses video models to learn physics for robots, outperforming baselines by up to 13.8% on Libero and matching specialized models in real-world tasks.
Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models
cs.RO 2026-05 unverdicted novelty 6.0

Pace-and-Path Correction is a closed-form inference-time operator that decomposes a quadratic cost minimization into orthogonal pace compression and path offset channels to correct dynamics-blindness in chunked-action...
HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models
cs.RO 2026-05 unverdicted novelty 6.0

HarmoWAM unifies predictive and reactive control in world action models via an adaptive gating mechanism to deliver improved zero-shot generalization and precision in robotic manipulation.
AT-VLA: Adaptive Tactile Injection for Enhanced Feedback Reaction in Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 6.0

AT-VLA introduces adaptive tactile injection and a dual-stream tactile reaction mechanism to integrate real-time tactile feedback into pretrained VLA models for contact-rich robotic manipulation.
ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 6.0

ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning
cs.RO 2026-04 unverdicted novelty 6.0

LaST-R1 reaches 99.8% average success on the LIBERO benchmark using one-shot warm-up plus LAPO reinforcement learning on latent physical reasoning, with up to 44% real-world gains on complex single- and dual-arm tasks.
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning
cs.RO 2026-04 unverdicted novelty 6.0

LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.
dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model
cs.RO 2026-04 unverdicted novelty 6.0

A discrete diffusion model tokenizes multimodal robotic data and uses a progress token to predict future states and task completion for scalable policy evaluation.
Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models
cs.RO 2026-04 unverdicted novelty 6.0

Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.
Adaptive Action Chunking at Inference-time for Vision-Language-Action Models
cs.RO 2026-04 unverdicted novelty 6.0

Adaptive Action Chunking uses action entropy to dynamically adjust chunk sizes in VLA models, improving performance on simulated and real robotic manipulation tasks.
DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control
cs.RO 2025-02 unverdicted novelty 6.0

DexVLA combines a scaled diffusion action expert with embodiment curriculum learning to achieve better generalization and performance than prior VLA models on diverse robot hardware and long-horizon tasks.
Marrying Generative Model of Healthcare Events with Digital Twin of Social Determinants of Health for Disease Reasoning
cs.AI 2026-05 unverdicted novelty 5.0

A generative framework using geometric diffusion for brain networks and tabular diffusion for other organs integrates ICD-coded SDoH proxies to improve disease reasoning on UK Biobank data.
Goal2Skill: Long-Horizon Manipulation with Adaptive Planning and Reflection
cs.RO 2026-04 unverdicted novelty 5.0

A dual VLM-VLA framework for long-horizon robot manipulation achieves 32.4% success on RMBench tasks versus 9.8% for the strongest baseline via structured memory and closed-loop adaptive replanning.
Adaptor: Advancing Assistive Teleoperation with Few-Shot Learning and Cross-Operator Generalization
cs.RO 2026-04 unverdicted novelty 5.0

Adaptor uses few-shot learning with trajectory perturbation and vision-language conditioning to achieve robust cross-operator intent recognition and higher success rates in assistive teleoperation.
Fast-dVLA: Accelerating Discrete Diffusion VLA to Real-Time Performance
cs.RO 2026-03 unverdicted novelty 5.0

Parameter differences from two training runs on a small task set are treated as auxiliary capability vectors that are merged into a pretrained VLA model, yielding auxiliary-task gains at the cost of ordinary supervise...
Vision-and-Language Navigation for UAVs: Progress, Challenges, and a Research Roadmap
cs.RO 2026-04 unverdicted novelty 4.0

A survey of UAV vision-and-language navigation that establishes a methodological taxonomy, reviews resources and challenges, and proposes a forward-looking research roadmap.
World Model for Robot Learning: A Comprehensive Survey
cs.RO 2026-04 unverdicted novelty 3.0

A comprehensive survey that organizes the literature on world models in robot learning, their roles in policy learning, planning, simulation, and video-based generation, with connections to navigation, driving, datase...

Reference graph

Works this paper leans on

127 extracted references · 127 canonical work pages · cited by 19 Pith papers · 37 internal anchors

[1]

PaLM-E: An Embodied Multimodal Language Model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li, Jiajun Wu, and Li Fei-Fei. V oxposer: Compos- able 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Towards generalist robot policies: What matters in building vision- language-action models

Xinghang Li, Peiyan Li, Minghuan Liu, Dong Wang, Jirong Liu, Bingyi Kang, Xiao Ma, Tao Kong, Hanbo Zhang, and Huaping Liu. Towards generalist robot policies: What matters in building vision- language-action models. arXiv preprint arXiv:2412.14058, 2024

work page arXiv 2024
[5]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024

work page 2024
[6]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022

work page 2022
[7]

Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023

work page 2023
[8]

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

Renrui Zhang, Jiaming Han, Chris Liu, Peng Gao, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language- action model. arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Manipllm: Embodied multimodal large language model for object-centric robotic manipulation

Xiaoqi Li, Mingxu Zhang, Yiran Geng, Haoran Geng, Yuxing Long, Yan Shen, Renrui Zhang, Jiaming Liu, and Hao Dong. Manipllm: Embodied multimodal large language model for object-centric robotic manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18061–18070, 2024

work page 2024
[12]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. pi0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation. arXiv preprint arXiv:2411.19650, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models. arXiv preprint arXiv:2501.09747, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Diffusion-vla: Scaling robot foundation models via unified diffusion and autoregression

Junjie Wen, Minjie Zhu, Yichen Zhu, Zhibin Tang, Jinming Li, Zhongyi Zhou, Chengmeng Li, Xiaoyu Liu, Yaxin Peng, Chaomin Shen, et al. Diffusion-vla: Scaling robot foundation models via unified diffusion and autoregression. arXiv preprint arXiv:2412.03293, 2024

work page arXiv 2024
[17]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020

work page 2020
[18]

Video diffusion models

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. Advances in Neural Information Processing Systems, 35:8633–8646, 2022

work page 2022
[19]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023

work page 2023
[20]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

work page 2022
[21]

Diffusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, page 02783649241273668, 2023

work page 2023
[22]

3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 3d diffusion policy. arXiv preprint arXiv:2403.03954, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

3d diffuser actor: Policy diffusion with 3d scene representations

Tsung-Wei Ke, Nikolaos Gkanatsios, and Katerina Fragkiadaki. 3d diffuser actor: Policy diffusion with 3d scene representations. arXiv preprint arXiv:2402.10885, 2024

work page arXiv 2024
[24]

Imitating human behaviour with diffusion models

Tim Pearce, Tabish Rashid, Anssi Kanervisto, Dave Bignell, Mingfei Sun, Raluca Georgescu, Sergio Val- carcel Macua, Shan Zheng Tan, Ida Momennejad, Katja Hofmann, et al. Imitating human behaviour with diffusion models. In The Eleventh International Conference on Learning Representations

work page
[25]

Goal-conditioned imitation learning using score-based diffusion policies

Moritz Reuss, Maximilian Li, Xiaogang Jia, and Rudolf Lioutikov. Goal-conditioned imitation learning using score-based diffusion policies. arXiv preprint arXiv:2304.02532, 2023

work page arXiv 2023
[26]

Chained- diffuser: Unifying trajectory diffusion and keypose prediction for robotic manipulation

Zhou Xian, Nikolaos Gkanatsios, Theophile Gervet, Tsung-Wei Ke, and Katerina Fragkiadaki. Chained- diffuser: Unifying trajectory diffusion and keypose prediction for robotic manipulation. In 7th Annual Conference on Robot Learning, 2023

work page 2023
[27]

Prismatic vlms: Investigating the design space of visually-conditioned language models

Siddharth Karamcheti, Suraj Nair, Ashwin Balakrishna, Percy Liang, Thomas Kollar, and Dorsa Sadigh. Prismatic vlms: Investigating the design space of visually-conditioned language models. arXiv preprint arXiv:2402.07865, 2024

work page arXiv 2024
[28]

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Abby O’Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, et al. Open x-embodiment: Robotic learning datasets and rt-x models. arXiv preprint arXiv:2310.08864, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karam- cheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abraham Lee, Y...

work page 2024
[30]

RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation

Kun Wu, Chengkai Hou, Jiaming Liu, Zhengping Che, Xiaozhu Ju, Zhuqin Yang, Meng Li, Yinuo Zhao, Zhiyuan Xu, Guang Yang, et al. Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation. arXiv preprint arXiv:2412.13877, 2024

work page internal anchor Pith review arXiv 2024
[31]

Rlbench: The robot learning benchmark & learning environment

Stephen James, Zicong Ma, David Rovick Arrojo, and Andrew J Davison. Rlbench: The robot learning benchmark & learning environment. IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020

work page 2020
[32]

Learning dexterous in-hand manipulation

OpenAI: Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob McGrew, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. Learning dexterous in-hand manipulation. The International Journal of Robotics Research, 39(1):3–20, 2020

work page 2020
[33]

End-to-end affordance learning for robotic manipulation

Yiran Geng, Boshi An, Haoran Geng, Yuanpei Chen, Yaodong Yang, and Hao Dong. End-to-end affordance learning for robotic manipulation. In International Conference on Robotics and Automation (ICRA), 2023

work page 2023
[34]

Robotic grasping using deep reinforcement learning

Shirin Joshi, Sulabh Kumra, and Ferat Sahin. Robotic grasping using deep reinforcement learning. In 2020 IEEE 16th International Conference on Automation Science and Engineering (CASE) , pages 1461–1466. IEEE, 2020

work page 2020
[35]

Mastering visual continuous control: Improved data-augmented reinforcement learning

Denis Yarats, Rob Fergus, Alessandro Lazaric, and Lerrel Pinto. Mastering visual continuous control: Improved data-augmented reinforcement learning. arXiv preprint arXiv:2107.09645, 2021

work page arXiv 2021
[36]

Behavioral Cloning from Observation

Faraz Torabi, Garrett Warnell, and Peter Stone. Behavioral cloning from observation. arXiv preprint arXiv:1805.01954, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[37]

Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation

Zipeng Fu, Tony Z Zhao, and Chelsea Finn. Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation. arXiv preprint arXiv:2401.02117, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real- world control at scale. arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[39]

Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models

Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models. ICLR 2025 Spotlight, 2024

work page 2025
[40]

Crayonrobo: Toward generic robot manipulation via crayon visual prompting

Xiaoqi Li, Lingyun Xu, Jiaming Liu, Mingxu Zhang, Jiahui Xu, Siyuan Huang, Iaroslav Ponomarenko, Yan Shen, Shanghang Zhang, and Hao Dong. Crayonrobo: Toward generic robot manipulation via crayon visual prompting

work page
[41]

Autonomous interactive correction mllm for robust robotic manipulation

Chuyan Xiong, Chengyu Shen, Xiaoqi Li, Kaichen Zhou, Jiaming Liu, Ruiping Wang, and Hao Dong. Autonomous interactive correction mllm for robust robotic manipulation. In 8th Annual Conference on Robot Learning, 2024

work page 2024
[42]

Naturalvlm: Leveraging fine-grained natural language for affordance-guided visual manipulation

Ran Xu, Yan Shen, Xiaoqi Li, Ruihai Wu, and Hao Dong. Naturalvlm: Leveraging fine-grained natural language for affordance-guided visual manipulation. arXiv preprint arXiv:2403.08355, 2024

work page arXiv 2024
[43]

Self-corrected multimodal large language model for end-to-end robot manipulation

Jiaming Liu, Chenxuan Li, Guanqun Wang, Lily Lee, Kaichen Zhou, Sixiang Chen, Chuyan Xiong, Jiaxin Ge, Renrui Zhang, and Shanghang Zhang. Self-corrected multimodal large language model for end-to-end robot manipulation. arXiv preprint arXiv:2405.17418, 2024

work page arXiv 2024
[44]

Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation

Wenlong Huang, Chen Wang, Yunzhu Li, Ruohan Zhang, and Li Fei-Fei. Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation. arXiv preprint arXiv:2409.01652, 2024

work page arXiv 2024
[45]

Robomamba: Multimodal state space model for efficient robot reasoning and manipulation

Jiaming Liu, Mengzhen Liu, Zhenyu Wang, Lily Lee, Kaichen Zhou, Pengju An, Senqiao Yang, Renrui Zhang, Yandong Guo, and Shanghang Zhang. Robomamba: Multimodal state space model for efficient robot reasoning and manipulation. arXiv preprint arXiv:2406.04339, 2024

work page arXiv 2024
[46]

Manipvqa: Injecting robotic affordance and physically grounded information into multi-modal large language models

Siyuan Huang, Iaroslav Ponomarenko, Zhengkai Jiang, Xiaoqi Li, Xiaobin Hu, Peng Gao, Hongsheng Li, and Hao Dong. Manipvqa: Injecting robotic affordance and physically grounded information into multi-modal large language models. arXiv preprint arXiv:2403.11289, 2024

work page arXiv 2024
[47]

Vision-language foundation models as effective robot imitators

Xinghang Li, Minghuan Liu, Hanbo Zhang, Cunjun Yu, Jie Xu, Hongtao Wu, Chilam Cheang, Ya Jing, Weinan Zhang, Huaping Liu, et al. Vision-language foundation models as effective robot imitators. arXiv preprint arXiv:2311.01378, 2023

work page arXiv 2023
[48]

Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation

Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre-training for visual robot manipulation. arXiv preprint arXiv:2312.13139, 2023. 12

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

Long short-term memory

Alex Graves and Alex Graves. Long short-term memory. Supervised sequence labelling with recurrent neural networks, pages 37–45, 2012

work page 2012
[50]

Is Conditional Generative Modeling all you need for Decision-Making?

Anurag Ajay, Yilun Du, Abhi Gupta, Joshua Tenenbaum, Tommi Jaakkola, and Pulkit Agrawal. Is conditional generative modeling all you need for decision-making? arXiv preprint arXiv:2211.15657, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[51]

Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning

Zhendong Wang, Jonathan J Hunt, and Mingyuan Zhou. Diffusion policies as an expressive policy class for offline reinforcement learning. arXiv preprint arXiv:2208.06193, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[52]

Imitating human behaviour with diffusion models

Tim Pearce, Tabish Rashid, Anssi Kanervisto, Dave Bignell, Mingfei Sun, Raluca Georgescu, Sergio Val- carcel Macua, Shan Zheng Tan, Ida Momennejad, Katja Hofmann, et al. Imitating human behaviour with diffusion models. arXiv preprint arXiv:2301.10677, 2023

work page arXiv 2023
[53]

Consistency policy: Accelerated visuomotor policies via consistency distillation

Aaditya Prasad, Kevin Lin, Jimmy Wu, Linqi Zhou, and Jeannette Bohg. Consistency policy: Accelerated visuomotor policies via consistency distillation. arXiv preprint arXiv:2405.07503, 2024

work page arXiv 2024
[54]

Shelving, stacking, hanging: Relational pose diffusion for multi-modal rearrangement

Anthony Simeonov, Ankit Goyal, Lucas Manuelli, Lin Yen-Chen, Alina Sarmiento, Alberto Rodriguez, Pulkit Agrawal, and Dieter Fox. Shelving, stacking, hanging: Relational pose diffusion for multi-modal rearrangement. arXiv preprint arXiv:2307.04751, 2023

work page arXiv 2023
[55]

Se (3)-diffusionfields: Learning smooth cost functions for joint grasp and motion optimization through diffusion

Julen Urain, Niklas Funk, Jan Peters, and Georgia Chalvatzaki. Se (3)-diffusionfields: Learning smooth cost functions for joint grasp and motion optimization through diffusion. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 5923–5930. IEEE, 2023

work page 2023
[56]

Learning score-based grasping primitive for human-assisting dexterous grasping

Tianhao Wu, Mingdong Wu, Jiyao Zhang, Yunchong Gan, and Hao Dong. Learning score-based grasping primitive for human-assisting dexterous grasping. Advances in Neural Information Processing Systems, 36:22132–22150, 2023

work page 2023
[57]

Planning with Diffusion for Flexible Behavior Synthesis

Michael Janner, Yilun Du, Joshua B Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis. arXiv preprint arXiv:2205.09991, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[58]

Edmp: Ensemble-of-costs-guided diffusion for motion planning

Kallol Saha, Vishal Mandadi, Jayaram Reddy, Ajit Srikanth, Aditya Agarwal, Bipasha Sen, Arun Singh, and Madhava Krishna. Edmp: Ensemble-of-costs-guided diffusion for motion planning. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 10351–10358. IEEE, 2024

work page 2024
[59]

Octo: An Open-Source Generalist Robot Policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy. arXiv preprint arXiv:2405.12213, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[60]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation. arXiv preprint arXiv:2410.07864, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[61]

Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation

Junjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Kun Wu, Zhiyuan Xu, Ning Liu, Ran Cheng, Chaomin Shen, Yaxin Peng, et al. Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation. arXiv preprint arXiv:2409.12514, 2024

work page arXiv 2024
[62]

AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xu Huang, Shu Jiang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. arXiv preprint arXiv:2503.06669, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[63]

Helix: A vision-language-action model for generalist humanoid control

figureai. Helix: A vision-language-action model for generalist humanoid control. https://www.figure. ai/news/helix. Accessed 2025.5.7

work page 2025
[64]

SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Multimodal models with unified multi-granularity comprehension and generation. arXiv preprint arXiv:2404.14396, 2024

work page internal anchor Pith review arXiv 2024
[65]

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. arXiv preprint arXiv:2410.13848, 2024

work page internal anchor Pith review arXiv 2024
[66]

VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, et al. Vila-u: a unified foundation model integrating visual understanding and generation. arXiv preprint arXiv:2409.04429, 2024

work page internal anchor Pith review arXiv 2024
[67]

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation. arXiv preprint arXiv:2408.12528, 2024. 13

work page internal anchor Pith review Pith/arXiv arXiv 2024
[68]

Monoformer: One transformer for both diffusion and autoregression

Chuyang Zhao, Yuxing Song, Wenhao Wang, Haocheng Feng, Errui Ding, Yifan Sun, Xinyan Xiao, and Jingdong Wang. Monoformer: One transformer for both diffusion and autoregression. arXiv preprint arXiv:2409.16280, 2024

work page arXiv 2024
[69]

Diffusion forcing: Next-token prediction meets full-sequence diffusion

Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion. Advances in Neural Information Processing Systems, 37:24081–24125, 2024

work page 2024
[70]

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model. arXiv preprint arXiv:2408.11039, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[71]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[72]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[73]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 11975–11986, 2023

work page 2023
[74]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021

work page 2021
[75]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[76]

Phi-2: The surprising power of small language models

Mojan Javaheripi, Sébastien Bubeck, Marah Abdin, Jyoti Aneja, Sebastien Bubeck, Caio César Teodoro Mendes, Weizhu Chen, Allie Del Giorno, Ronen Eldan, Sivakanth Gopi, et al. Phi-2: The surprising power of small language models. Microsoft Research Blog, 1(3):3, 2023

work page 2023
[77]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[78]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[79]

Perceiver-actor: A multi-task transformer for robotic manipulation

Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-actor: A multi-task transformer for robotic manipulation. In Proceedings of the 6th Conference on Robot Learning (CoRL), 2022

work page 2022
[80]

The open motion planning library

Ioan A Sucan, Mark Moll, and Lydia E Kavraki. The open motion planning library. IEEE Robotics & Automation Magazine, 19(4):72–82, 2012

work page 2012

Showing first 80 references.