MVP-LAM: Learning Action-Centric Latent Action via Cross-Viewpoint Reconstruction

· 2026 · cs.RO · arXiv 2602.03668

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Latent actions learned from diverse human videos serve as pseudo-labels for vision-language-action (VLA) pretraining, but provide effective supervision only if they remain informative about the underlying ground-truth actions. For effective supervision, latent actions should contain information about the underlying actions even though they are inaccessible. We propose Multi-ViewPoint Latent Action Moel (MVP-LAM), which learns latent actions that are highly informative about ground-truth actions from multi-view videos. MVP-LAM trains latent actions with a cross-viewpoint reconstruction objective, so that a latent action from one view must explain the future in another view, reducing reliance on viewpoint-specific cues. On Bridge V2, MVP-LAM produces more action-centric latent actions, achieving higher mutual information with ground-truth actions and improved action prediction, including under out-of-distribution evaluation. Finally, pretraining VLAs with MVP-LAM latent actions improves downstream manipulation performance on various benchmarks. The code and trained checkpoints are available at https://jmsnu.github.io.

representative citing papers

Why Latent Actions Fail, and How to Prevent It

cs.CV · 2026-05-13 · unverdicted · novelty 6.0

Extending linear LAMs to model exogenous state shows standard reconstruction encodes future exogenous info in latent actions, while endogenous-focused spaces and auxiliary objectives like action-supervision enforce consistency across noise.

citing papers explorer

Showing 1 of 1 citing paper.

Why Latent Actions Fail, and How to Prevent It cs.CV · 2026-05-13 · unverdicted · none · ref 5 · internal anchor
Extending linear LAMs to model exogenous state shows standard reconstruction encodes future exogenous info in latent actions, while endogenous-focused spaces and auxiliary objectives like action-supervision enforce consistency across noise.

MVP-LAM: Learning Action-Centric Latent Action via Cross-Viewpoint Reconstruction

fields

years

verdicts

representative citing papers

citing papers explorer