Mixed citations

Vid2World: Crafting video diffusion models to interactive world models.arXiv preprint arXiv:2505.14357, 2025

Siqiao Huang, Jialong Wu, Qixing Zhou, Shangchen Miao, Mingsheng Long · 2025 · arXiv 2505.14357

Mixed citation behavior. Most common role is background (67%).

13 Pith papers citing it

Background 67% of classified citations

read on arXiv browse 13 citing papers

citation-role summary

background 5 baseline 1

citation-polarity summary

background 4 baseline 1 unclear 1

representative citing papers

ACWM-Phys: Investigating Generalized Physical Interaction in Action-Conditioned Video World Models

cs.CV · 2026-05-09 · unverdicted · novelty 7.0 · 2 refs

ACWM-Phys is a controllable simulator benchmark with in- and out-of-distribution protocols for evaluating action-conditioned world models across rigid, kinematic, deformable, and particle dynamics.

Learning Visual Feature-Based World Models via Residual Latent Action

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

RLA-WM predicts residual latent actions via flow matching to create visual feature world models that outperform prior feature-based and diffusion approaches while enabling offline video-based robot RL.

MultiWorld: Scalable Multi-Agent Multi-View Video World Models

cs.CV · 2026-04-20 · unverdicted · novelty 7.0

MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.

DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

cs.RO · 2026-02-06 · unverdicted · novelty 7.0

DreamDojo is a foundation world model pretrained on the largest human video dataset to date that uses continuous latent actions to transfer interaction knowledge and achieves controllable physics simulation after robot post-training.

PanoWorld: Geometry-Consistent Panoramic Video World Modeling

cs.CV · 2026-05-14 · unverdicted · novelty 6.0

PanoWorld adds depth consistency and trajectory consistency losses plus spherical adaptations to a pre-trained video model, plus a new PanoGeo dataset, to produce geometry-consistent 360 video.

Diffusion Model as a Generalist Segmentation Learner

cs.CV · 2026-04-27 · unverdicted · novelty 6.0

DiGSeg repurposes diffusion U-Nets as generalist segmentation learners by conditioning on image-mask latents and multi-scale CLIP text features, achieving strong cross-domain performance.

PhyEdit: Towards Real-World Object Manipulation via Physically-Grounded Image Editing

cs.CV · 2026-04-08 · unverdicted · novelty 6.0

PhyEdit improves physical accuracy in image object manipulation by using explicit geometric simulation as 3D-aware guidance combined with joint 2D-3D supervision.

Co-Evolving Latent Action World Models

cs.LG · 2025-10-30 · unverdicted · novelty 6.0

CoLA-World jointly trains latent action models and world models with a warm-up phase to achieve co-evolution, matching or exceeding prior two-stage methods in video simulation quality and visual planning performance.

Physically Viable World Models: A Case for Query-Conditioned Embodied AI

cs.AI · 2026-05-28 · unverdicted · novelty 5.0

Embodied AI requires query-conditioned world models that select the simplest physical abstraction sufficient to answer intervention queries.

OrbiSim: World Models as Differentiable Physics Engines for Embodied Intelligence

cs.RO · 2026-05-12 · unverdicted · novelty 5.0

OrbiSim builds a differentiable physics engine from world models to support gradient-based policy optimization and contact modeling in robotics.

Reconstruction or Semantics? What Makes a Latent Space Useful for Robotic World Models

cs.CV · 2026-05-07 · unverdicted · novelty 5.0

Semantic latent spaces from pretrained encoders outperform reconstruction-based spaces for robotic world models on planning and downstream policy performance.

WorldString: Actionable World Representation

cs.AI · 2026-05-18 · unverdicted · novelty 4.0 · 2 refs

Proposes WorldString, a differentiable neural model for the state manifold of actionable physical objects learned directly from 3D or video data as a building block for world models.

FashionChameleon: Towards Real-Time and Interactive Human-Garment Video Customization

cs.CV · 2026-05-15

citing papers explorer

Showing 13 of 13 citing papers.

ACWM-Phys: Investigating Generalized Physical Interaction in Action-Conditioned Video World Models cs.CV · 2026-05-09 · unverdicted · none · ref 10 · 2 links
ACWM-Phys is a controllable simulator benchmark with in- and out-of-distribution protocols for evaluating action-conditioned world models across rigid, kinematic, deformable, and particle dynamics.
Learning Visual Feature-Based World Models via Residual Latent Action cs.CV · 2026-05-08 · unverdicted · none · ref 6
RLA-WM predicts residual latent actions via flow matching to create visual feature world models that outperform prior feature-based and diffusion approaches while enabling offline video-based robot RL.
MultiWorld: Scalable Multi-Agent Multi-View Video World Models cs.CV · 2026-04-20 · unverdicted · none · ref 15
MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.
DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos cs.RO · 2026-02-06 · unverdicted · none · ref 40
DreamDojo is a foundation world model pretrained on the largest human video dataset to date that uses continuous latent actions to transfer interaction knowledge and achieves controllable physics simulation after robot post-training.
PanoWorld: Geometry-Consistent Panoramic Video World Modeling cs.CV · 2026-05-14 · unverdicted · none · ref 11
PanoWorld adds depth consistency and trajectory consistency losses plus spherical adaptations to a pre-trained video model, plus a new PanoGeo dataset, to produce geometry-consistent 360 video.
Diffusion Model as a Generalist Segmentation Learner cs.CV · 2026-04-27 · unverdicted · none · ref 38
DiGSeg repurposes diffusion U-Nets as generalist segmentation learners by conditioning on image-mask latents and multi-scale CLIP text features, achieving strong cross-domain performance.
PhyEdit: Towards Real-World Object Manipulation via Physically-Grounded Image Editing cs.CV · 2026-04-08 · unverdicted · none · ref 21
PhyEdit improves physical accuracy in image object manipulation by using explicit geometric simulation as 3D-aware guidance combined with joint 2D-3D supervision.
Co-Evolving Latent Action World Models cs.LG · 2025-10-30 · unverdicted · none · ref 18
CoLA-World jointly trains latent action models and world models with a warm-up phase to achieve co-evolution, matching or exceeding prior two-stage methods in video simulation quality and visual planning performance.
Physically Viable World Models: A Case for Query-Conditioned Embodied AI cs.AI · 2026-05-28 · unverdicted · none · ref 37
Embodied AI requires query-conditioned world models that select the simplest physical abstraction sufficient to answer intervention queries.
OrbiSim: World Models as Differentiable Physics Engines for Embodied Intelligence cs.RO · 2026-05-12 · unverdicted · none · ref 19
OrbiSim builds a differentiable physics engine from world models to support gradient-based policy optimization and contact modeling in robotics.
Reconstruction or Semantics? What Makes a Latent Space Useful for Robotic World Models cs.CV · 2026-05-07 · unverdicted · none · ref 25
Semantic latent spaces from pretrained encoders outperform reconstruction-based spaces for robotic world models on planning and downstream policy performance.
WorldString: Actionable World Representation cs.AI · 2026-05-18 · unverdicted · none · ref 21 · 2 links
Proposes WorldString, a differentiable neural model for the state manifold of actionable physical objects learned directly from 3D or video data as a building block for world models.
FashionChameleon: Towards Real-Time and Interactive Human-Garment Video Customization cs.CV · 2026-05-15 · unreviewed · ref 42

Vid2World: Crafting video diffusion models to interactive world models.arXiv preprint arXiv:2505.14357, 2025

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer