A unified comparison of latent action supervision strategies for VLA models reveals task-specific benefits, with image-based approaches aiding reasoning and generalization, action-based aiding motor control, and discrete tokens proving most effective.
Latent action learning requires supervision in the presence of distractors
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 3roles
background 1polarities
background 1representative citing papers
villa-X enhances latent action modeling in VLA models to support zero-shot action planning for unseen robot embodiments and open-vocabulary instructions, yielding better manipulation results in simulation and real-world tests.
Motus unifies understanding, video generation, and action in one latent world model via MoT experts and optical-flow latent actions, reporting gains over prior methods in simulation and real robots.
citing papers explorer
-
From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models
A unified comparison of latent action supervision strategies for VLA models reveals task-specific benefits, with image-based approaches aiding reasoning and generalization, action-based aiding motor control, and discrete tokens proving most effective.
-
villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models
villa-X enhances latent action modeling in VLA models to support zero-shot action planning for unseen robot embodiments and open-vocabulary instructions, yielding better manipulation results in simulation and real-world tests.
-
Motus: A Unified Latent Action World Model
Motus unifies understanding, video generation, and action in one latent world model via MoT experts and optical-flow latent actions, reporting gains over prior methods in simulation and real robots.