Formalizes video world models as group actions on states and uses latent regularization with synthesized supervision to enforce consistency, introducing GAC and GAR metrics that improve structural correctness in SOTA models.
Mixed citations
Vid2world: Crafting video diffusion models to interactive world models.arXiv preprint arXiv:2505.14357,
Mixed citation behavior. Most common role is background (67%).
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 18representative citing papers
ACWM-Phys is a controllable simulator benchmark with in- and out-of-distribution protocols for evaluating action-conditioned world models across rigid, kinematic, deformable, and particle dynamics.
RLA-WM predicts residual latent actions via flow matching to create visual feature world models that outperform prior feature-based and diffusion approaches while enabling offline video-based robot RL.
MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.
DreamDojo is a foundation world model pretrained on the largest human video dataset to date that uses continuous latent actions to transfer interaction knowledge and achieves controllable physics simulation after robot post-training.
RoboWorld introduces an automated pipeline using autoregressive video world models and task-progress VLM scoring, plus Step Forcing for long-horizon stability, to achieve high correlation with real robot policy evaluation.
ADWM learns a latent diffusion world model with per-transition independent denoising and policy-conditioned guidance to enable accurate offline evaluation of LLM agent policies.
GIM-World adds a camera-queryable geometry distillation head and pruning rule to implicit memory in video world models, claiming better long-horizon geometric consistency on the MIND benchmark than explicit and implicit baselines.
Introduces mesh tokenization to condition DiT-based video diffusion models directly on 3D human meshes for motion control without 2D rendering.
FashionChameleon achieves interactive multi-garment video customization at 23.8 FPS via in-context teacher models, streaming distillation, and training-free KV cache rescheduling while using only single-garment data.
PanoWorld adds depth consistency and trajectory consistency losses plus spherical adaptations to a pre-trained video model, plus a new PanoGeo dataset, to produce geometry-consistent 360 video.
DiGSeg repurposes diffusion U-Nets as generalist segmentation learners by conditioning on image-mask latents and multi-scale CLIP text features, achieving strong cross-domain performance.
PhyEdit improves physical accuracy in image object manipulation by using explicit geometric simulation as 3D-aware guidance combined with joint 2D-3D supervision.
CoLA-World jointly trains latent action models and world models with a warm-up phase to achieve co-evolution, matching or exceeding prior two-stage methods in video simulation quality and visual planning performance.
Embodied AI requires query-conditioned world models that select the simplest physical abstraction sufficient to answer intervention queries.
OrbiSim builds a differentiable physics engine from world models to support gradient-based policy optimization and contact modeling in robotics.
Semantic latent spaces from pretrained encoders outperform reconstruction-based spaces for robotic world models on planning and downstream policy performance.
Proposes WorldString, a differentiable neural model for the state manifold of actionable physical objects learned directly from 3D or video data as a building block for world models.
citing papers explorer
-
Physically Viable World Models: A Case for Query-Conditioned Embodied AI
Embodied AI requires query-conditioned world models that select the simplest physical abstraction sufficient to answer intervention queries.
-
WorldString: Actionable World Representation
Proposes WorldString, a differentiable neural model for the state manifold of actionable physical objects learned directly from 3D or video data as a building block for world models.