Mixed citations

Vid2world: Crafting video diffusion models to interactive world models.arXiv preprint arXiv:2505.14357,

Siqiao Huang, Jialong Wu, Qixing Zhou, Shangchen Miao, Mingsheng Long · 2025 · arXiv 2505.14357

Mixed citation behavior. Most common role is background (67%).

18 Pith papers citing it

Background 67% of classified citations

read on arXiv browse 18 citing papers

citation-role summary

background 5 baseline 1

citation-polarity summary

background 4 baseline 1 unclear 1

representative citing papers

World Models as Group Actions

cs.CV · 2026-05-23 · unverdicted · novelty 7.0

Formalizes video world models as group actions on states and uses latent regularization with synthesized supervision to enforce consistency, introducing GAC and GAR metrics that improve structural correctness in SOTA models.

ACWM-Phys: Investigating Generalized Physical Interaction in Action-Conditioned Video World Models

cs.CV · 2026-05-09 · unverdicted · novelty 7.0 · 2 refs

ACWM-Phys is a controllable simulator benchmark with in- and out-of-distribution protocols for evaluating action-conditioned world models across rigid, kinematic, deformable, and particle dynamics.

Learning Visual Feature-Based World Models via Residual Latent Action

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

RLA-WM predicts residual latent actions via flow matching to create visual feature world models that outperform prior feature-based and diffusion approaches while enabling offline video-based robot RL.

MultiWorld: Scalable Multi-Agent Multi-View Video World Models

cs.CV · 2026-04-20 · unverdicted · novelty 7.0

MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.

DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

cs.RO · 2026-02-06 · unverdicted · novelty 7.0

DreamDojo is a foundation world model pretrained on the largest human video dataset to date that uses continuous latent actions to transfer interaction knowledge and achieves controllable physics simulation after robot post-training.

RoboWorld: Fast and Reliable Neural Simulators for Generalist Robot Policy Evaluation

cs.RO · 2026-07-01 · unverdicted · novelty 6.0

RoboWorld introduces an automated pipeline using autoregressive video world models and task-progress VLM scoring, plus Step Forcing for long-horizon stability, to achieve high correlation with real robot policy evaluation.

Autoregressive Diffusion World Models for Off-Policy Evaluation of LLM Agents

cs.LG · 2026-06-04 · unverdicted · novelty 6.0

ADWM learns a latent diffusion world model with per-transition independent denoising and policy-conditioned guidance to enable accurate offline evaluation of LLM agent policies.

Geometry-Aware Implicit Memory for Video World Models

cs.CV · 2026-06-01 · unverdicted · novelty 6.0

GIM-World adds a camera-queryable geometry distillation head and pruning rule to implicit memory in video world models, claiming better long-horizon geometric consistency on the MIND benchmark than explicit and implicit baselines.

Towards 3D-Aware Video Diffusion Models: Render-Free Human Motion Control with Mesh Tokenization

cs.CV · 2026-06-01 · unverdicted · novelty 6.0

Introduces mesh tokenization to condition DiT-based video diffusion models directly on 3D human meshes for motion control without 2D rendering.

FashionChameleon: Towards Real-Time and Interactive Human-Garment Video Customization

cs.CV · 2026-05-15 · unverdicted · novelty 6.0 · 2 refs

FashionChameleon achieves interactive multi-garment video customization at 23.8 FPS via in-context teacher models, streaming distillation, and training-free KV cache rescheduling while using only single-garment data.

PanoWorld: Geometry-Consistent Panoramic Video World Modeling

cs.CV · 2026-05-14 · unverdicted · novelty 6.0

PanoWorld adds depth consistency and trajectory consistency losses plus spherical adaptations to a pre-trained video model, plus a new PanoGeo dataset, to produce geometry-consistent 360 video.

Diffusion Model as a Generalist Segmentation Learner

cs.CV · 2026-04-27 · unverdicted · novelty 6.0

DiGSeg repurposes diffusion U-Nets as generalist segmentation learners by conditioning on image-mask latents and multi-scale CLIP text features, achieving strong cross-domain performance.

PhyEdit: Towards Real-World Object Manipulation via Physically-Grounded Image Editing

cs.CV · 2026-04-08 · unverdicted · novelty 6.0

PhyEdit improves physical accuracy in image object manipulation by using explicit geometric simulation as 3D-aware guidance combined with joint 2D-3D supervision.

Co-Evolving Latent Action World Models

cs.LG · 2025-10-30 · unverdicted · novelty 6.0

CoLA-World jointly trains latent action models and world models with a warm-up phase to achieve co-evolution, matching or exceeding prior two-stage methods in video simulation quality and visual planning performance.

Physically Viable World Models: A Case for Query-Conditioned Embodied AI

cs.AI · 2026-05-28 · unverdicted · novelty 5.0

Embodied AI requires query-conditioned world models that select the simplest physical abstraction sufficient to answer intervention queries.

OrbiSim: World Models as Differentiable Physics Engines for Embodied Intelligence

cs.RO · 2026-05-12 · unverdicted · novelty 5.0

OrbiSim builds a differentiable physics engine from world models to support gradient-based policy optimization and contact modeling in robotics.

Reconstruction or Semantics? What Makes a Latent Space Useful for Robotic World Models

cs.CV · 2026-05-07 · unverdicted · novelty 5.0

Semantic latent spaces from pretrained encoders outperform reconstruction-based spaces for robotic world models on planning and downstream policy performance.

WorldString: Actionable World Representation

cs.AI · 2026-05-18 · unverdicted · novelty 4.0 · 2 refs

Proposes WorldString, a differentiable neural model for the state manifold of actionable physical objects learned directly from 3D or video data as a building block for world models.

citing papers explorer

Showing 2 of 2 citing papers after filters.

Physically Viable World Models: A Case for Query-Conditioned Embodied AI cs.AI · 2026-05-28 · unverdicted · none · ref 37
Embodied AI requires query-conditioned world models that select the simplest physical abstraction sufficient to answer intervention queries.
WorldString: Actionable World Representation cs.AI · 2026-05-18 · unverdicted · none · ref 21 · 2 links
Proposes WorldString, a differentiable neural model for the state manifold of actionable physical objects learned directly from 3D or video data as a building block for world models.

Vid2world: Crafting video diffusion models to interactive world models.arXiv preprint arXiv:2505.14357,

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer