hub Mixed citations

villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models

Xiaoyu Chen, Hangxing Wei, Pushi Zhang, Chuheng Zhang, Kaixin Wang, Yanjiang Guo · 2025 · cs.RO · arXiv 2507.23682

Mixed citation behavior. Most common role is background (62%).

33 Pith papers citing it

Background 62% of classified citations

open full Pith review browse 33 citing papers arXiv PDF

abstract

Vision-Language-Action (VLA) models have emerged as a popular paradigm for learning robot manipulation policies that can follow language instructions and generalize to novel scenarios. Recent works have begun to explore the incorporation of latent actions, abstract representations of motion between two frames, into VLA pre-training. In this paper, we introduce villa-X, a novel Vision-Language-Latent-Action (ViLLA) framework that advances latent action modeling for learning generalizable robot manipulation policies. Our approach improves both how latent actions are learned and how they are incorporated into VLA pre-training. We demonstrate that villa-X can generate latent action plans in a zero-shot fashion, even for unseen embodiments and open-vocabulary symbolic understanding. This capability enables villa-X to achieve superior performance across diverse simulation tasks in SIMPLER and on two real-world robotic setups involving both gripper and dexterous hand manipulation. These results establish villa-X as a principled and scalable paradigm for learning generalizable robot manipulation policies. We believe it provides a strong foundation for future research.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 10 other 2 baseline 1

citation-polarity summary

background 8 unclear 4 baseline 1

representative citing papers

Point Tracking Improves World Action Models

cs.RO · 2026-05-22 · unverdicted · novelty 7.0

JOPAT jointly models pixels, point tracks, and actions in a diffusion transformer and reports gains over pixel-only baselines on long-horizon robot tasks with occlusion and off-screen motion.

RotVLA: Rotational Latent Action for Vision-Language-Action Model

cs.RO · 2026-05-13 · unverdicted · novelty 7.0

RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.

Being-H0.7: A Latent World-Action Model from Egocentric Videos

cs.RO · 2026-04-30 · unverdicted · novelty 7.0

Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.

Learning Additively Compositional Latent Actions for Embodied AI

cs.CV · 2026-04-03 · unverdicted · novelty 7.0

AC-LAM enforces additive composition on latent actions from visual transitions, yielding more structured and calibrated motion latents that improve downstream embodied policy learning over prior LAMs.

UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models

cs.RO · 2026-02-23 · unverdicted · novelty 7.0

UniLACT improves VLA models by adding depth-aware unified latent action pretraining that outperforms RGB-only baselines on seen and unseen manipulation tasks.

Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation

cs.RO · 2026-02-18 · unverdicted · novelty 7.0

PhysGen uses video models to learn physics for robots, outperforming baselines by up to 13.8% on Libero and matching specialized models in real-world tasks.

DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

cs.RO · 2026-02-06 · unverdicted · novelty 7.0

DreamDojo is a foundation world model pretrained on the largest human video dataset to date that uses continuous latent actions to transfer interaction knowledge and achieves controllable physics simulation after robot post-training.

Imitation from Heterogeneous Demonstrations using Grounded Latent-Action World Models

cs.RO · 2026-06-19 · unverdicted · novelty 6.0

GLAM learns a shared latent action space grounded in consistent future observation prediction across heterogeneous data sources to train improved behavioral cloning policies for robot manipulation tasks.

PoLAR: Factorizing Extent and Mode in Latent Actions for Robot Policy Learning

cs.RO · 2026-06-19 · unverdicted · novelty 6.0

PoLAR imposes radial structure on latent actions in hyperbolic space to factorize extent and mode, improving robot policy performance over baselines.

Motion-Focused Latent Action Enables Cross-Embodiment VLA Training from Human EgoVideos

cs.CV · 2026-06-17 · unverdicted · novelty 6.0 · 2 refs

A Hybrid Disentangled VQ-VAE with physical masks creates a cross-embodiment action codebook from human videos, allowing VLA pre-training that adapts to new embodiments with only 50 trajectories.

EgoInfinity: A Web-Scale 4D Hand-Object Interaction Data Engine for Any-View Robot Retargeting and Video-to-Action Robot Learning

cs.RO · 2026-06-16 · unverdicted · novelty 6.0

EgoInfinity is a modular pipeline that lifts in-the-wild RGB videos into agent-agnostic 4D hand-object data with interaction-aware refinement and retargets motions to diverse robot morphologies for video-to-action learning.

LARA: Latent Action Representation Alignment for Vision-Language-Action Models

cs.CV · 2026-06-05 · unverdicted · novelty 6.0 · 2 refs

LARA jointly optimizes LAM and VLA models via representation alignment to improve robotic manipulation performance using human videos.

UAM: A Dual-Stream Perspective on Forgetting in VLA Training

cs.CV · 2026-05-15 · unverdicted · novelty 6.0

UAM adds a Dorsal Expert initialized from a generative model and trained on visual dynamics prediction to preserve over 95% of VLM multimodal ability in VLA training while achieving top success rates on manipulation tasks including OOD cases.

SCAR: Self-Supervised Continuous Action Representation Learning

cs.RO · 2026-05-13 · unverdicted · novelty 6.0

SCAR proposes a joint inverse-forward dynamics framework to learn transferable continuous action representations across embodiments from visual data using regularization and adversarial invariance.

CUBic: Coordinated Unified Bimanual Perception and Control Framework

cs.RO · 2026-05-13 · unverdicted · novelty 6.0

CUBic learns a shared tokenized representation for bimanual robot perception and control via unidirectional aggregation, bidirectional codebook coordination, and a unified diffusion policy, yielding higher coordination accuracy and task success on the RoboTwin benchmark.

Reinforcing VLAs in Task-Agnostic World Models

cs.AI · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

RAW-Dream disentangles world-model learning from task data by using a pre-trained task-agnostic world model and VLM rewards, with dual-noise filtering, to enable zero-shot VLA adaptation in simulation and real settings.

Unified Noise Steering for Efficient Human-Guided VLA Adaptation

cs.RO · 2026-05-11 · unverdicted · novelty 6.0

UniSteer unifies human corrective actions and noise-space RL for VLA adaptation by inverting actions to noise targets, raising success rates from 20% to 90% in 66 minutes across four real-world manipulation tasks.

ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models

cs.RO · 2026-05-11 · unverdicted · novelty 6.0 · 2 refs

ALAM introduces algebraic consistency regularization on latent action transitions from videos, raising VLA success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.

Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation

cs.RO · 2026-05-08 · unverdicted · novelty 6.0

Anchor-Centric Adaptation escapes the diversity trap by prioritizing repeated demonstrations at core anchors over broad coverage, yielding higher success rates under fixed data budgets in robotic manipulation.

From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models

cs.RO · 2026-05-06 · unverdicted · novelty 6.0

A unified comparison of latent action supervision strategies for VLA models reveals task-specific benefits, with image-based approaches aiding reasoning and generalization, action-based aiding motor control, and discrete tokens proving most effective.

GazeVLA: Learning Human Intention for Robotic Manipulation

cs.RO · 2026-04-24 · unverdicted · novelty 6.0

GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.

UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling

cs.RO · 2026-04-21 · unverdicted · novelty 6.0

UniT creates a unified physical language via visual anchoring and tri-branch reconstruction to enable scalable human-to-humanoid transfer for policy learning and world modeling.

Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models

cs.RO · 2026-04-14 · unverdicted · novelty 6.0

Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.

Veo-Act: How Far Can Frontier Video Models Advance Generalizable Robot Manipulation?

cs.RO · 2026-04-06 · unverdicted · novelty 6.0

Veo-3 video predictions enable approximate task-level robot trajectories in zero-shot settings but require hierarchical integration with low-level VLA policies for reliable manipulation performance.

citing papers explorer

Showing 10 of 10 citing papers after filters.

Being-H0.7: A Latent World-Action Model from Egocentric Videos cs.RO · 2026-04-30 · unverdicted · none · ref 84 · internal anchor
Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos cs.RO · 2026-02-06 · unverdicted · none · ref 17 · internal anchor
DreamDojo is a foundation world model pretrained on the largest human video dataset to date that uses continuous latent actions to transfer interaction knowledge and achieves controllable physics simulation after robot post-training.
Unified Noise Steering for Efficient Human-Guided VLA Adaptation cs.RO · 2026-05-11 · unverdicted · none · ref 8 · internal anchor
UniSteer unifies human corrective actions and noise-space RL for VLA adaptation by inverting actions to noise targets, raising success rates from 20% to 90% in 66 minutes across four real-world manipulation tasks.
From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models cs.RO · 2026-05-06 · unverdicted · none · ref 15 · internal anchor
A unified comparison of latent action supervision strategies for VLA models reveals task-specific benefits, with image-based approaches aiding reasoning and generalization, action-based aiding motor control, and discrete tokens proving most effective.
UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling cs.RO · 2026-04-21 · unverdicted · none · ref 23 · internal anchor
UniT creates a unified physical language via visual anchoring and tri-branch reconstruction to enable scalable human-to-humanoid transfer for policy learning and world modeling.
PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation cs.RO · 2026-01-11 · unverdicted · none · ref 18 · internal anchor
PALM improves long-horizon robotic manipulation success by distilling affordance representations for object interaction and predicting within-subtask progress in a VLA model.
Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey cs.RO · 2025-08-18 · unverdicted · none · ref 142 · internal anchor
This survey organizes large VLM-based VLA models for robotic manipulation into monolithic and hierarchical paradigms, reviews their integrations and datasets, and outlines future directions.
Towards Robotic Dexterous Hand Intelligence: A Survey cs.RO · 2026-05-13 · unverdicted · none · ref 114 · internal anchor
A structured survey of dexterous robotic hand research that reviews hardware, control methods, data resources, and benchmarks while identifying major limitations and future directions.
World Action Models: The Next Frontier in Embodied AI cs.RO · 2026-05-12 · unverdicted · none · ref 89 · internal anchor
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
RLDX-1 Technical Report cs.RO · 2026-05-05 · unverdicted · none · ref 25 · 2 links · internal anchor
RLDX-1 outperforms frontier VLAs such as π0.5 and GR00T N1.6 on dexterous manipulation benchmarks, reaching 86.8% success on ALLEX humanoid tasks versus around 40% for the baselines.

villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer