OTF decomposes transitions into reusable primitives to form action-like latents in OTF-LAM and OTF-LAM-Dino, enabling zeroshot transfer and competitive policy learning under visual ambiguity.
Mixed citations
Learning latent action world models in the wild
Mixed citation behavior. Most common role is background (60%).
citation-role summary
citation-polarity summary
years
2026 12verdicts
UNVERDICTED 12roles
background 5representative citing papers
NOVA represents world states as INR weights for decoder-free rendering, compactness, and unsupervised disentanglement of background, foreground, and motion in video world models.
World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.
DreamDojo is a foundation world model pretrained on the largest human video dataset to date that uses continuous latent actions to transfer interaction knowledge and achieves controllable physics simulation after robot post-training.
CLAW is an end-to-end self-supervised method that learns semantically meaningful continuous latent actions and predictive world models from action-free videos to support imitation learning and goal-directed planning.
DiLA uses content-structure disentanglement driven by predictive bottlenecks to create semantically structured latent actions for high-fidelity video world models.
SCAR proposes a joint inverse-forward dynamics framework to learn transferable continuous action representations across embodiments from visual data using regularization and adversarial invariance.
Extending linear LAMs to model exogenous state shows standard reconstruction encodes future exogenous info in latent actions, while endogenous-focused spaces and auxiliary objectives like action-supervision enforce consistency across noise.
GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.
The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and proposes Epistemic World Models as a new category for scientific discovery agents.
VAE world model trained on embodied exploration develops latent representations aligned with physical geometry, with metrics improving together and collapsing together under high KL regularization.
PhyWorld improves temporal consistency and physical plausibility in video world models via flow matching fine-tuning followed by DPO on physics preference pairs, with reported gains on VBench and a custom physical-faithfulness benchmark.
citing papers explorer
-
Emergent Semantic Representations in World Models through Physical Interaction without Linguistic Supervision
VAE world model trained on embodied exploration develops latent representations aligned with physical geometry, with metrics improving together and collapsing together under high KL regularization.