pith. sign in

hub Mixed citations

UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

Mixed citation behavior. Most common role is background (69%).

69 Pith papers citing it
Background 69% of classified citations
abstract

A generalist robot should perform effectively across various environments. However, most existing approaches heavily rely on scaling action-annotated data to enhance their capabilities. Consequently, they are often limited to single physical specification and struggle to learn transferable knowledge across different embodiments and environments. To confront these limitations, we propose UniVLA, a new framework for learning cross-embodiment vision-language-action (VLA) policies. Our key innovation is to derive task-centric action representations from videos with a latent action model. This enables us to exploit extensive data across a wide spectrum of embodiments and perspectives. To mitigate the effect of task-irrelevant dynamics, we incorporate language instructions and establish a latent action model within the DINO feature space. Learned from internet-scale videos, the generalist policy can be deployed to various robots through efficient latent action decoding. We obtain state-of-the-art results across multiple manipulation and navigation benchmarks, as well as real-robot deployments. UniVLA achieves superior performance over OpenVLA with less than 1/20 of pretraining compute and 1/10 of downstream data. Continuous performance improvements are observed as heterogeneous data, even including human videos, are incorporated into the training pipeline. The results underscore UniVLA's potential to facilitate scalable and efficient robot policy learning.

hub tools

citation-role summary

background 18 baseline 4 method 3 dataset 1

citation-polarity summary

claims ledger

  • abstract A generalist robot should perform effectively across various environments. However, most existing approaches heavily rely on scaling action-annotated data to enhance their capabilities. Consequently, they are often limited to single physical specification and struggle to learn transferable knowledge across different embodiments and environments. To confront these limitations, we propose UniVLA, a new framework for learning cross-embodiment vision-language-action (VLA) policies. Our key innovation is to derive task-centric action representations from videos with a latent action model. This enab
  • background generative modeling in pixel space, while transformer-based architectures like ACT [50] and Perceiver-Actor [36] leverage spatiotemporal attention for long-horizon manipulation. In paral- lel, recent 2D VLA models further extend visuomotor learning to multimodal settings. Systems such as OpenVLA [21], OpenVLA-OFT [20],π 0 [3], RT-2 [56], RT-X [29], RoboFlamingo [25], Octo [38], GR-1 [41], and UniVLA [4] integrate large vision-language backbones with robot ac- tion policies, enabling semantic gro
  • method a latent space 𝑍∈R 𝑇×𝐶×𝐻×𝑊 , where 𝑇 , 𝐶, 𝐻, and 𝑊 denote the number of frames, channel, height, respectively. Unlike video/image VAEs whose primary goal is compression, our approach targets semantic understanding by leveraging a self-supervised backbone with rich high-level representations. Specifically, we adopt DI- NOv2 [39], which is trained with both contrastive learning [ 8] and masked image modeling [57]. Causally-Constrained Framewise Autoregression.Videos pos- sess an intrinsic temporal
  • baseline Table 1: Benchmark comparison on multiple embodied manipulation tasks. CALVIN denotes "ABCD→D" and CALVIN∗denotes "ABC→D", LIBERO-plus∗denotes finetuning with LIBERO-plus dataset Model Size LIBERO LIBERO-plus LIBERO-plus ∗RoboCasa-50 GR1 CALVIN CALVIN∗Robotwin2 # VLA π0 [4] 3B 94.4 53.6 - 42.4 - - 3.92 65.9/58.4 π0-FAST[111] 3B 85.5 61.6 - - - - - - X-VLA [112] 0.9B - - - - - 4.43 - 72.9/72.8 UniVLA [87] 8B 95.5 - - - - 4.63 4.41 - gr00t-N1.6 [5] 3B 93.9 - - 36.0 47.6 4.60 4.24 - π0.5 [40] 3B 96
  • baseline actionless videos using latent actions. RLA achieves the highest average success rate and rank. Suc- cess rates are evaluated over 50 episodes (seeds 42-91) and averaged over the last five checkpoints. Method PushT Roll Pull Pull Tool Poke Rank # Avg SR " BC-ResNet 3.6 42.0 33.6 7.6 49.2 3.8 27.2 DINO CLS [ 39 ] 7.6 39.6 40.4 4.4 44.8 4.0 27.4 UniVLA [ 44 ] 6.0 37.6 42.8 7.2 50.0 3.8 28.7 AdaWorld [ 24 ] 9.2 38.4 48.4 10.8 61.6 2.2 33.7 RLA (Ours) 15.2 43.8 43.6 12.0 63.6 1.2 35.6 4.2 Minimalist
  • background The second paradigm treats humans as an alternative embodiment, either by jointly training on human and robot data or by aligning behaviors in a shared latent action space [6,35,72,82]. Despite reducing some representationgaps,thesemethodsstillfaceasubstantialcross-embodimentgap. The third paradigm leverages human data for general visual representation or predictive world-model pretraining [11,46,73]. However, these methods mainly focus onwhatactions are executed, they largely overlookwhya parti
  • background by explicitly validating the dataset through a large-scale, re- producible study designed to ensure robot-learning readiness. Robot Learning from Human Data.Human data presents two main opportunities for robot learning: abundant unlabeled online videos and curated, labeled demonstrations [1, 29, 46, 52]. Web videos, though plentiful, require pseudo-labeling of actions via inverse dynamics models [6, 13, 55], affor- dances [2, 44], or point tracking [4, 42, 50] for policy training, forming a basi

co-cited works

representative citing papers

UAM: A Dual-Stream Perspective on Forgetting in VLA Training

cs.CV · 2026-05-15 · unverdicted · novelty 6.0

UAM adds a Dorsal Expert initialized from a generative model and trained on visual dynamics prediction to preserve over 95% of VLM multimodal ability in VLA training while achieving top success rates on manipulation tasks including OOD cases.

DiLA: Disentangled Latent Action World Models

cs.CV · 2026-05-15 · unverdicted · novelty 6.0

DiLA uses content-structure disentanglement driven by predictive bottlenecks to create semantically structured latent actions for high-fidelity video world models.

CUBic: Coordinated Unified Bimanual Perception and Control Framework

cs.RO · 2026-05-13 · unverdicted · novelty 6.0

CUBic learns a shared tokenized representation for bimanual robot perception and control via unidirectional aggregation, bidirectional codebook coordination, and a unified diffusion policy, yielding higher coordination accuracy and task success on the RoboTwin benchmark.

Why Latent Actions Fail, and How to Prevent It

cs.CV · 2026-05-13 · unverdicted · novelty 6.0

Extending linear LAMs to model exogenous state shows standard reconstruction encodes future exogenous info in latent actions, while endogenous-focused spaces and auxiliary objectives like action-supervision enforce consistency across noise.

citing papers explorer

Showing 50 of 69 citing papers.