World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.
Video world models with long-term spatial memory
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 9verdicts
UNVERDICTED 9roles
background 1polarities
background 1representative citing papers
MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.
SWIFT introduces a semantic injection cache with head-wise updates and an adaptive dynamic window plus segment anchors to achieve efficient multi-prompt long video generation at 22.6 FPS while preserving quality in causal diffusion models.
CityRAG generates minutes-long 3D-consistent videos of real-world cities by grounding outputs in geo-registered data and using temporally unaligned training to disentangle fixed scenes from transient elements like weather.
Interpolating exo and ego videos into a single continuous sequence lets diffusion sequence models generate more coherent first-person videos than direct conditioning, even without pose interpolation.
Lyra 2.0 produces persistent 3D-consistent video sequences for large explorable worlds by using per-frame geometry for information routing and self-augmented training to correct temporal drift.
Rein3D generates photorealistic, globally consistent 3D indoor scenes by using a restore-and-refine process where radial panoramic videos are restored via diffusion models and then used to update a 3D Gaussian field.
World-R1 uses RL with 3D model feedback and a new text dataset to improve geometric consistency in text-to-video generation while keeping the base model unchanged.
This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.
citing papers explorer
-
Latent State Design for World Models under Sufficiency Constraints
World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.
-
MultiWorld: Scalable Multi-Agent Multi-View Video World Models
MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.
-
SWIFT: Prompt-Adaptive Memory for Efficient Interactive Long Video Generation
SWIFT introduces a semantic injection cache with head-wise updates and an adaptive dynamic window plus segment anchors to achieve efficient multi-prompt long video generation at 22.6 FPS while preserving quality in causal diffusion models.
-
CityRAG: Stepping Into a City via Spatially-Grounded Video Generation
CityRAG generates minutes-long 3D-consistent videos of real-world cities by grounding outputs in geo-registered data and using temporally unaligned training to disentangle fixed scenes from transient elements like weather.
-
From Synchrony to Sequence: Exo-to-Ego Generation via Interpolation
Interpolating exo and ego videos into a single continuous sequence lets diffusion sequence models generate more coherent first-person videos than direct conditioning, even without pose interpolation.
-
Lyra 2.0: Explorable Generative 3D Worlds
Lyra 2.0 produces persistent 3D-consistent video sequences for large explorable worlds by using per-frame geometry for information routing and self-augmented training to correct temporal drift.
-
Rein3D: Reinforced 3D Indoor Scene Generation with Panoramic Video Diffusion Models
Rein3D generates photorealistic, globally consistent 3D indoor scenes by using a restore-and-refine process where radial panoramic videos are restored via diffusion models and then used to update a 3D Gaussian field.
-
World-R1: Reinforcing 3D Constraints for Text-to-Video Generation
World-R1 uses RL with 3D model feedback and a new text dataset to improve geometric consistency in text-to-video generation while keeping the base model unchanged.
-
Evolution of Video Generative Foundations
This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.