MemLearner introduces a learning-based adaptive context query method using query tokens in video world models to improve long-term scene consistency over rule-based retrieval.
hub Canonical reference
Video world models with long-term spatial memory
Canonical reference. 89% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
What-If World is a new paired-prompt benchmark showing that nine state-of-the-art video generation models achieve at most 52% on causal intervention tests and cluster near 28% for open-source systems.
World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.
MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.
A generative video model conditioned on pixel-aligned 3D renderings produces consistent dynamic 3D Gaussian splats from monocular video and sets new SOTA in 4D reconstruction.
Pano2World generates an explorable 3D Gaussian scene directly from a single indoor panorama via coarse proxy rendering, view-aware joint denoising, and a latent feature adapter.
PermaVid disentangles spatial context into semantic appearance and geometric structure via multi-modal memory banks and edit-aware updates to maintain long-term consistency in video generation after edits.
Envision4D presents a feed-forward 4D Gaussian Splatting framework with future pose prediction, temporal attention, and conditioned motion lifting for pose-free extrapolation in autonomous driving scenes.
Mirage stores and queries 3D scene information in diffusion latent space via depth-guided lifting and warping, yielding 10.57× faster generation and 55× smaller memory than explicit RGB point-cloud baselines while reaching SOTA on WorldScore.
A controlled study finds that block-wise state-space recurrence outperforms other memory designs for open-domain scene return in action-conditioned video models, and that standard replay metrics do not adequately measure memory quality.
MetaWorld scales multi-agent video world models from single-view videos using monocular decomposition into ego-motion and trajectories, subject-aware generation, and cross-attention alignment for consistency.
GIM-World adds a camera-queryable geometry distillation head and pruning rule to implicit memory in video world models, claiming better long-horizon geometric consistency on the MIND benchmark than explicit and implicit baselines.
Robust Dreamer uses Latent Gaussian Memory anchored to diffusion latents and Deviation Learning with a Dynamic Deviation Archive to reduce drift in long-horizon action-controlled image-to-video generation, reporting SOTA results on ScanNet, DL3DV, and OmniWorldGame.
E³C is a video diffusion model that disentangles persistent 3D scene structure via point-cloud memory from human dynamics via ego-exo pose controls for improved egocentric video generation on the Nymeria dataset.
A curiosity-based 3D exploration policy that pairs persistent online 3D reconstruction with episodic sequence modeling over RGB to outperform active-mapping baselines on HM3D and transfer zero-shot to Gibson and synthetic worlds.
GeoFlow adds a geometry-consistency reward based on rigid camera flow and object appearance preservation, integrated via reinforcement fine-tuning to improve geometric coherence in video generation.
Warp-as-History enables zero-shot camera trajectory following in frozen video models by supplying camera-warped pseudo-history, with single-video LoRA fine-tuning improving generalization to unseen videos.
SWIFT introduces a semantic injection cache with head-wise updates and an adaptive dynamic window plus segment anchors to achieve efficient multi-prompt long video generation at 22.6 FPS while preserving quality in causal diffusion models.
Lyra 2.0 produces persistent 3D-consistent video sequences for large explorable worlds by using per-frame geometry for information routing and self-augmented training to correct temporal drift.
Rein3D generates photorealistic, globally consistent 3D indoor scenes by using a restore-and-refine process where radial panoramic videos are restored via diffusion models and then used to update a 3D Gaussian field.
Geometry Forcing aligns video diffusion representations with geometric foundation model features via angular cosine and scale regression objectives to improve 3D consistency in generated videos.
A decoupled-control autoregressive video model using Fast-Slow Memory training, dynamic projection, and staged camera control to produce stable long-horizon outputs with human and viewpoint guidance.
WorldOlympiad is a new benchmark decomposing world-model evaluation into physical, geometry, and interaction tracks using segmentation, MLLM judges, Gaussian splatting, and action prompts on diverse scenarios.
DecMem proposes a decoupled memory system using sparse global and anchored local components to enable consistent minute-long controllable video generation in world models.
citing papers explorer
-
MemLearner: Learning to Query Context memory for Video World Models
MemLearner introduces a learning-based adaptive context query method using query tokens in video world models to improve long-term scene consistency over rule-based retrieval.
-
What-If World: A Causal Benchmark for General World Models in Embodied Scenarios
What-If World is a new paired-prompt benchmark showing that nine state-of-the-art video generation models achieve at most 52% on causal intervention tests and cluster near 28% for open-source systems.
-
MultiWorld: Scalable Multi-Agent Multi-View Video World Models
MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.
-
World from Motion: Generative Dynamic Gaussian Reconstruction from Monocular Video
A generative video model conditioned on pixel-aligned 3D renderings produces consistent dynamic 3D Gaussian splats from monocular video and sets new SOTA in 4D reconstruction.
-
Pano2World: End-to-End 3D Generation via Unified Multi-View Sequences
Pano2World generates an explorable 3D Gaussian scene directly from a single indoor panorama via coarse proxy rendering, view-aware joint denoising, and a latent feature adapter.
-
PermaVid: Consistent Video Generation Across Edits via Disentangled Context Memory
PermaVid disentangles spatial context into semantic appearance and geometric structure via multi-modal memory banks and edit-aware updates to maintain long-term consistency in video generation after edits.
-
Envision4D: Envisioning Visual Futures via Feed-forward 4D Gaussian Splatting for Autonomous Driving
Envision4D presents a feed-forward 4D Gaussian Splatting framework with future pose prediction, temporal attention, and conditioned motion lifting for pose-free extrapolation in autonomous driving scenes.
-
Latent Spatial Memory for Video World Models
Mirage stores and queries 3D scene information in diffusion latent space via depth-guided lifting and warping, yielding 10.57× faster generation and 55× smaller memory than explicit RGB point-cloud baselines while reaching SOTA on WorldScore.
-
Echo-Memory: A Controlled Study of Memory in Action World Models
A controlled study finds that block-wise state-space recurrence outperforms other memory designs for open-domain scene return in action-conditioned video models, and that standard replay metrics do not adequately measure memory quality.
-
MetaWorld: Scaling Multi-Agent Video World Model from Single-view Video Data
MetaWorld scales multi-agent video world models from single-view videos using monocular decomposition into ego-motion and trajectories, subject-aware generation, and cross-attention alignment for consistency.
-
Geometry-Aware Implicit Memory for Video World Models
GIM-World adds a camera-queryable geometry distillation head and pruning rule to implicit memory in video world models, claiming better long-horizon geometric consistency on the MIND benchmark than explicit and implicit baselines.
-
Robust Dreamer: Deviation-Aware Latent Gaussian Memory for Action-Controlled AR Video Generation
Robust Dreamer uses Latent Gaussian Memory anchored to diffusion latents and Deviation Learning with a Dynamic Deviation Archive to reduce drift in long-horizon action-controlled image-to-video generation, reporting SOTA results on ScanNet, DL3DV, and OmniWorldGame.
-
E$^3$C: Video Generation with 3D Environmental Memory and Ego-Exo Human Pose Control
E³C is a video diffusion model that disentangles persistent 3D scene structure via point-cloud memory from human dynamics via ego-exo pose controls for improved egocentric video generation on the Nymeria dataset.
-
GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation
GeoFlow adds a geometry-consistency reward based on rigid camera flow and object appearance preservation, integrated via reinforcement fine-tuning to improve geometric coherence in video generation.
-
Warp-as-History: Generalizable Camera-Controlled Video Generation from One Training Video
Warp-as-History enables zero-shot camera trajectory following in frozen video models by supplying camera-warped pseudo-history, with single-video LoRA fine-tuning improving generalization to unseen videos.
-
SWIFT: Prompt-Adaptive Memory for Efficient Interactive Long Video Generation
SWIFT introduces a semantic injection cache with head-wise updates and an adaptive dynamic window plus segment anchors to achieve efficient multi-prompt long video generation at 22.6 FPS while preserving quality in causal diffusion models.
-
Lyra 2.0: Explorable Generative 3D Worlds
Lyra 2.0 produces persistent 3D-consistent video sequences for large explorable worlds by using per-frame geometry for information routing and self-augmented training to correct temporal drift.
-
Rein3D: Reinforced 3D Indoor Scene Generation with Panoramic Video Diffusion Models
Rein3D generates photorealistic, globally consistent 3D indoor scenes by using a restore-and-refine process where radial panoramic videos are restored via diffusion models and then used to update a 3D Gaussian field.
-
Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling
Geometry Forcing aligns video diffusion representations with geometric foundation model features via angular cosine and scale regression objectives to improve 3D consistency in generated videos.
-
Directing the World: Fast Autoregressive Video Generation with Compositional Human-Camera Control
A decoupled-control autoregressive video model using Fast-Slow Memory training, dynamic projection, and staged camera control to produce stable long-horizon outputs with human and viewpoint guidance.
-
WorldOlympiad: Can Your World Model Survive a Triathlon?
WorldOlympiad is a new benchmark decomposing world-model evaluation into physical, geometry, and interaction tracks using segmentation, MLLM judges, Gaussian splatting, and action prompts on diverse scenarios.
-
DecMem: Towards Minute-Long Consistent World Generation with Decoupled Memory
DecMem proposes a decoupled memory system using sparse global and anchored local components to enable consistent minute-long controllable video generation in world models.
-
InSpatio-WorldFM: An Open-Source Real-Time Generative Frame Model
InSpatio-WorldFM is a frame-independent generative model that uses explicit 3D anchors and spatial memory to deliver real-time multi-view consistent spatial intelligence via a three-stage training pipeline from pretrained diffusion models.
-
World-R1: Reinforcing 3D Constraints for Text-to-Video Generation
World-R1 applies reinforcement learning via Flow-GRPO and a text dataset to align text-to-video models with 3D constraints from pre-trained foundation models, improving consistency while keeping original visual quality.
-
Evolution of Video Generative Foundations
This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.
- CityRAG: Stepping Into a City via Spatially-Grounded Video Generation
- From Synchrony to Sequence: Exo-to-Ego Generation via Interpolation