MoVerse generates real-time interactive video world models from single narrow-FOV images via panoramic diffusion expansion, Gaussian scaffold lifting, and distillation of a bidirectional diffusion teacher into a causal autoregressive renderer.
Evoworld: Evolving panoramic world generation with explicit 3d memory
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 6roles
background 3polarities
background 3representative citing papers
E³C is a video diffusion model that disentangles persistent 3D scene structure via point-cloud memory from human dynamics via ego-exo pose controls for improved egocentric video generation on the Nymeria dataset.
Sensor2Sensor uses 4D Gaussian Splatting to create synthetic training pairs and a diffusion model to convert monocular dashcam videos into high-fidelity multi-modal AV sensor data.
Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.
This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.
citing papers explorer
-
MoVerse: Real-Time Video World Modeling with Panoramic Gaussian Scaffold
MoVerse generates real-time interactive video world models from single narrow-FOV images via panoramic diffusion expansion, Gaussian scaffold lifting, and distillation of a bidirectional diffusion teacher into a causal autoregressive renderer.
-
E$^3$C: Video Generation with 3D Environmental Memory and Ego-Exo Human Pose Control
E³C is a video diffusion model that disentangles persistent 3D scene structure via point-cloud memory from human dynamics via ego-exo pose controls for improved egocentric video generation on the Nymeria dataset.
-
Sensor2Sensor: Cross-Embodiment Sensor Conversion for Autonomous Driving
Sensor2Sensor uses 4D Gaussian Splatting to create synthetic training pairs and a diffusion model to convert monocular dashcam videos into high-fidelity multi-modal AV sensor data.
-
Evolution of Video Generative Foundations
This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.
- OpenWorldLib: A Unified Codebase and Definition of Advanced World Models