Olaf-World: Orienting Latent Actions for Video World Modeling

Ivor W. Tsang; Mike Zheng Shou; Yuchao Gu; Yuxin Jiang

Olaf-World: Orienting Latent Actions for Video World Modeling

Not yet reviewed by Pith; the record is open.

Re-run · record.json Download PDF Read on arXiv ↗

This paper has not been read by Pith yet. Machine review is queued; the pith claim, tier, and objections will appear here once it completes.

SPECIMEN: schema-true, not a live event

T0 review · schema-true

One-sentence machine reading of the paper's core claim.

pith:XXXXXXXX · record.json · timestamp

arxiv 2602.10104 v2 pith:7KFA46BP submitted 2026-02-10 cs.CV cs.AIcs.LG

Olaf-World: Orienting Latent Actions for Video World Modeling

Yuxin Jiang , Yuchao Gu , Ivor W. Tsang , Mike Zheng Shou This is my paper

classification cs.CV cs.AIcs.LG

keywords actionvideolatentworldacrossactionscontextscontrol

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

0 comments

read the original abstract

Scaling action-controllable world models is limited by the scarcity of action labels. While latent action learning promises to extract control interfaces from unlabeled video, learned latents often fail to transfer across contexts: they entangle scene-specific cues and lack a shared coordinate system. This occurs because standard objectives operate only within each clip, providing no mechanism to align action semantics across contexts. Our key insight is that although actions are unobserved, their semantic effects are observable and can serve as a shared reference. We introduce Seq$\Delta$-REPA, a sequence-level control-effect alignment objective that anchors integrated latent action to temporal feature differences from a frozen, self-supervised video encoder. Building on this, we present Olaf-World, a pipeline that pretrains action-conditioned video world models from large-scale passive video. Extensive experiments demonstrate that our method learns a more structured latent action space, leading to stronger zero-shot action transfer and more data-efficient adaptation to new control interfaces than state-of-the-art baselines.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ShadowDancer: Teaching Video World Models Any Action by Learning Unified Dynamics Representations from a Video and Its Shadow
cs.CV 2026-07 conditional novelty 7.0

Cross-shadow prediction on appearance-resampled video pairs yields a unified latent dynamics interface that transfers demonstrated actions across environments better than prior latent-action and interactive world models.
PoLAR: Factorizing Extent and Mode in Latent Actions for Robot Policy Learning
cs.RO 2026-06 unverdicted novelty 6.0

PoLAR imposes radial structure on latent actions in hyperbolic space to factorize extent and mode, improving robot policy performance over baselines.
Towards Interactive Video World Modeling: Frontiers, Challenges, Benchmarks, and Future Trends
cs.CV 2026-05 unverdicted novelty 2.0

This survey reviews trends, challenges, benchmarks, and future directions in action-conditioned interactive world modeling for video and 3D generation.