hub Canonical reference

Video world models with long-term spatial memory

Tong Wu, Shuai Yang, Ryan Po, Yinghao Xu, Ziwei Liu, Dahua Lin, Gordon Wetzstein · 2025 · arXiv 2506.05284

Canonical reference. 89% of citing Pith papers cite this work as background.

29 Pith papers citing it

Background 89% of classified citations

read on arXiv browse 29 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 8 baseline 1

citation-polarity summary

background 8 baseline 1

representative citing papers

MemLearner: Learning to Query Context memory for Video World Models

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

MemLearner introduces a learning-based adaptive context query method using query tokens in video world models to improve long-term scene consistency over rule-based retrieval.

What-If World: A Causal Benchmark for General World Models in Embodied Scenarios

cs.CV · 2026-05-26 · unverdicted · novelty 7.0

What-If World is a new paired-prompt benchmark showing that nine state-of-the-art video generation models achieve at most 52% on causal intervention tests and cluster near 28% for open-source systems.

Latent State Design for World Models under Sufficiency Constraints

cs.AI · 2026-05-03 · unverdicted · novelty 7.0

World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.

MultiWorld: Scalable Multi-Agent Multi-View Video World Models

cs.CV · 2026-04-20 · unverdicted · novelty 7.0

MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.

World from Motion: Generative Dynamic Gaussian Reconstruction from Monocular Video

cs.CV · 2026-07-01 · unverdicted · novelty 6.0

A generative video model conditioned on pixel-aligned 3D renderings produces consistent dynamic 3D Gaussian splats from monocular video and sets new SOTA in 4D reconstruction.

Pano2World: End-to-End 3D Generation via Unified Multi-View Sequences

cs.CV · 2026-07-01 · unverdicted · novelty 6.0

Pano2World generates an explorable 3D Gaussian scene directly from a single indoor panorama via coarse proxy rendering, view-aware joint denoising, and a latent feature adapter.

PermaVid: Consistent Video Generation Across Edits via Disentangled Context Memory

cs.CV · 2026-06-15 · unverdicted · novelty 6.0

PermaVid disentangles spatial context into semantic appearance and geometric structure via multi-modal memory banks and edit-aware updates to maintain long-term consistency in video generation after edits.

Envision4D: Envisioning Visual Futures via Feed-forward 4D Gaussian Splatting for Autonomous Driving

cs.CV · 2026-06-09 · unverdicted · novelty 6.0

Envision4D presents a feed-forward 4D Gaussian Splatting framework with future pose prediction, temporal attention, and conditioned motion lifting for pose-free extrapolation in autonomous driving scenes.

Latent Spatial Memory for Video World Models

cs.CV · 2026-06-08 · unverdicted · novelty 6.0

Mirage stores and queries 3D scene information in diffusion latent space via depth-guided lifting and warping, yielding 10.57× faster generation and 55× smaller memory than explicit RGB point-cloud baselines while reaching SOTA on WorldScore.

Echo-Memory: A Controlled Study of Memory in Action World Models

cs.CV · 2026-06-08 · unverdicted · novelty 6.0

A controlled study finds that block-wise state-space recurrence outperforms other memory designs for open-domain scene return in action-conditioned video models, and that standard replay metrics do not adequately measure memory quality.

MetaWorld: Scaling Multi-Agent Video World Model from Single-view Video Data

cs.CV · 2026-06-01 · unverdicted · novelty 6.0

MetaWorld scales multi-agent video world models from single-view videos using monocular decomposition into ego-motion and trajectories, subject-aware generation, and cross-attention alignment for consistency.

Geometry-Aware Implicit Memory for Video World Models

cs.CV · 2026-06-01 · unverdicted · novelty 6.0

GIM-World adds a camera-queryable geometry distillation head and pruning rule to implicit memory in video world models, claiming better long-horizon geometric consistency on the MIND benchmark than explicit and implicit baselines.

Robust Dreamer: Deviation-Aware Latent Gaussian Memory for Action-Controlled AR Video Generation

cs.CV · 2026-05-29 · unverdicted · novelty 6.0

Robust Dreamer uses Latent Gaussian Memory anchored to diffusion latents and Deviation Learning with a Dynamic Deviation Archive to reduce drift in long-horizon action-controlled image-to-video generation, reporting SOTA results on ScanNet, DL3DV, and OmniWorldGame.

E$^3$C: Video Generation with 3D Environmental Memory and Ego-Exo Human Pose Control

cs.CV · 2026-05-25 · unverdicted · novelty 6.0

E³C is a video diffusion model that disentangles persistent 3D scene structure via point-cloud memory from human dynamics via ego-exo pose controls for improved egocentric video generation on the Nymeria dataset.

Remember to be Curious: Episodic Context and Persistent Worlds for 3D Exploration

cs.LG · 2026-05-21 · unverdicted · novelty 6.0

A curiosity-based 3D exploration policy that pairs persistent online 3D reconstruction with episodic sequence modeling over RGB to outperform active-mapping baselines on HM3D and transfer zero-shot to Gibson and synthetic worlds.

GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation

cs.CV · 2026-05-18 · unverdicted · novelty 6.0

GeoFlow adds a geometry-consistency reward based on rigid camera flow and object appearance preservation, integrated via reinforcement fine-tuning to improve geometric coherence in video generation.

Warp-as-History: Generalizable Camera-Controlled Video Generation from One Training Video

cs.CV · 2026-05-14 · unverdicted · novelty 6.0

Warp-as-History enables zero-shot camera trajectory following in frozen video models by supplying camera-warped pseudo-history, with single-video LoRA fine-tuning improving generalization to unseen videos.

SWIFT: Prompt-Adaptive Memory for Efficient Interactive Long Video Generation

cs.CV · 2026-05-10 · unverdicted · novelty 6.0

SWIFT introduces a semantic injection cache with head-wise updates and an adaptive dynamic window plus segment anchors to achieve efficient multi-prompt long video generation at 22.6 FPS while preserving quality in causal diffusion models.

Lyra 2.0: Explorable Generative 3D Worlds

cs.CV · 2026-04-14 · unverdicted · novelty 6.0

Lyra 2.0 produces persistent 3D-consistent video sequences for large explorable worlds by using per-frame geometry for information routing and self-augmented training to correct temporal drift.

Rein3D: Reinforced 3D Indoor Scene Generation with Panoramic Video Diffusion Models

cs.CV · 2026-04-12 · unverdicted · novelty 6.0

Rein3D generates photorealistic, globally consistent 3D indoor scenes by using a restore-and-refine process where radial panoramic videos are restored via diffusion models and then used to update a 3D Gaussian field.

Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

cs.CV · 2025-07-10 · unverdicted · novelty 6.0

Geometry Forcing aligns video diffusion representations with geometric foundation model features via angular cosine and scale regression objectives to improve 3D consistency in generated videos.

Directing the World: Fast Autoregressive Video Generation with Compositional Human-Camera Control

cs.CV · 2026-06-26 · unverdicted · novelty 5.0

A decoupled-control autoregressive video model using Fast-Slow Memory training, dynamic projection, and staged camera control to produce stable long-horizon outputs with human and viewpoint guidance.

WorldOlympiad: Can Your World Model Survive a Triathlon?

cs.CV · 2026-06-09 · unverdicted · novelty 5.0

WorldOlympiad is a new benchmark decomposing world-model evaluation into physical, geometry, and interaction tracks using segmentation, MLLM judges, Gaussian splatting, and action prompts on diverse scenarios.

DecMem: Towards Minute-Long Consistent World Generation with Decoupled Memory

cs.CV · 2026-05-29 · unverdicted · novelty 5.0

DecMem proposes a decoupled memory system using sparse global and anchored local components to enable consistent minute-long controllable video generation in world models.

citing papers explorer

Showing 27 of 27 citing papers after filters.

MemLearner: Learning to Query Context memory for Video World Models cs.CV · 2026-06-30 · unverdicted · none · ref 56
MemLearner introduces a learning-based adaptive context query method using query tokens in video world models to improve long-term scene consistency over rule-based retrieval.
What-If World: A Causal Benchmark for General World Models in Embodied Scenarios cs.CV · 2026-05-26 · unverdicted · none · ref 69
What-If World is a new paired-prompt benchmark showing that nine state-of-the-art video generation models achieve at most 52% on causal intervention tests and cluster near 28% for open-source systems.
MultiWorld: Scalable Multi-Agent Multi-View Video World Models cs.CV · 2026-04-20 · unverdicted · none · ref 54
MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.
World from Motion: Generative Dynamic Gaussian Reconstruction from Monocular Video cs.CV · 2026-07-01 · unverdicted · none · ref 59
A generative video model conditioned on pixel-aligned 3D renderings produces consistent dynamic 3D Gaussian splats from monocular video and sets new SOTA in 4D reconstruction.
Pano2World: End-to-End 3D Generation via Unified Multi-View Sequences cs.CV · 2026-07-01 · unverdicted · none · ref 19
Pano2World generates an explorable 3D Gaussian scene directly from a single indoor panorama via coarse proxy rendering, view-aware joint denoising, and a latent feature adapter.
PermaVid: Consistent Video Generation Across Edits via Disentangled Context Memory cs.CV · 2026-06-15 · unverdicted · none · ref 4
PermaVid disentangles spatial context into semantic appearance and geometric structure via multi-modal memory banks and edit-aware updates to maintain long-term consistency in video generation after edits.
Envision4D: Envisioning Visual Futures via Feed-forward 4D Gaussian Splatting for Autonomous Driving cs.CV · 2026-06-09 · unverdicted · none · ref 44
Envision4D presents a feed-forward 4D Gaussian Splatting framework with future pose prediction, temporal attention, and conditioned motion lifting for pose-free extrapolation in autonomous driving scenes.
Latent Spatial Memory for Video World Models cs.CV · 2026-06-08 · unverdicted · none · ref 18
Mirage stores and queries 3D scene information in diffusion latent space via depth-guided lifting and warping, yielding 10.57× faster generation and 55× smaller memory than explicit RGB point-cloud baselines while reaching SOTA on WorldScore.
Echo-Memory: A Controlled Study of Memory in Action World Models cs.CV · 2026-06-08 · unverdicted · none · ref 57
A controlled study finds that block-wise state-space recurrence outperforms other memory designs for open-domain scene return in action-conditioned video models, and that standard replay metrics do not adequately measure memory quality.
MetaWorld: Scaling Multi-Agent Video World Model from Single-view Video Data cs.CV · 2026-06-01 · unverdicted · none · ref 31
MetaWorld scales multi-agent video world models from single-view videos using monocular decomposition into ego-motion and trajectories, subject-aware generation, and cross-attention alignment for consistency.
Geometry-Aware Implicit Memory for Video World Models cs.CV · 2026-06-01 · unverdicted · none · ref 55
GIM-World adds a camera-queryable geometry distillation head and pruning rule to implicit memory in video world models, claiming better long-horizon geometric consistency on the MIND benchmark than explicit and implicit baselines.
Robust Dreamer: Deviation-Aware Latent Gaussian Memory for Action-Controlled AR Video Generation cs.CV · 2026-05-29 · unverdicted · none · ref 63
Robust Dreamer uses Latent Gaussian Memory anchored to diffusion latents and Deviation Learning with a Dynamic Deviation Archive to reduce drift in long-horizon action-controlled image-to-video generation, reporting SOTA results on ScanNet, DL3DV, and OmniWorldGame.
E$^3$C: Video Generation with 3D Environmental Memory and Ego-Exo Human Pose Control cs.CV · 2026-05-25 · unverdicted · none · ref 64
E³C is a video diffusion model that disentangles persistent 3D scene structure via point-cloud memory from human dynamics via ego-exo pose controls for improved egocentric video generation on the Nymeria dataset.
GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation cs.CV · 2026-05-18 · unverdicted · none · ref 78
GeoFlow adds a geometry-consistency reward based on rigid camera flow and object appearance preservation, integrated via reinforcement fine-tuning to improve geometric coherence in video generation.
Warp-as-History: Generalizable Camera-Controlled Video Generation from One Training Video cs.CV · 2026-05-14 · unverdicted · none · ref 10
Warp-as-History enables zero-shot camera trajectory following in frozen video models by supplying camera-warped pseudo-history, with single-video LoRA fine-tuning improving generalization to unseen videos.
SWIFT: Prompt-Adaptive Memory for Efficient Interactive Long Video Generation cs.CV · 2026-05-10 · unverdicted · none · ref 27
SWIFT introduces a semantic injection cache with head-wise updates and an adaptive dynamic window plus segment anchors to achieve efficient multi-prompt long video generation at 22.6 FPS while preserving quality in causal diffusion models.
Lyra 2.0: Explorable Generative 3D Worlds cs.CV · 2026-04-14 · unverdicted · none · ref 117
Lyra 2.0 produces persistent 3D-consistent video sequences for large explorable worlds by using per-frame geometry for information routing and self-augmented training to correct temporal drift.
Rein3D: Reinforced 3D Indoor Scene Generation with Panoramic Video Diffusion Models cs.CV · 2026-04-12 · unverdicted · none · ref 51
Rein3D generates photorealistic, globally consistent 3D indoor scenes by using a restore-and-refine process where radial panoramic videos are restored via diffusion models and then used to update a 3D Gaussian field.
Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling cs.CV · 2025-07-10 · unverdicted · none · ref 79
Geometry Forcing aligns video diffusion representations with geometric foundation model features via angular cosine and scale regression objectives to improve 3D consistency in generated videos.
Directing the World: Fast Autoregressive Video Generation with Compositional Human-Camera Control cs.CV · 2026-06-26 · unverdicted · none · ref 61
A decoupled-control autoregressive video model using Fast-Slow Memory training, dynamic projection, and staged camera control to produce stable long-horizon outputs with human and viewpoint guidance.
WorldOlympiad: Can Your World Model Survive a Triathlon? cs.CV · 2026-06-09 · unverdicted · none · ref 42
WorldOlympiad is a new benchmark decomposing world-model evaluation into physical, geometry, and interaction tracks using segmentation, MLLM judges, Gaussian splatting, and action prompts on diverse scenarios.
DecMem: Towards Minute-Long Consistent World Generation with Decoupled Memory cs.CV · 2026-05-29 · unverdicted · none · ref 41
DecMem proposes a decoupled memory system using sparse global and anchored local components to enable consistent minute-long controllable video generation in world models.
InSpatio-WorldFM: An Open-Source Real-Time Generative Frame Model cs.CV · 2026-03-12 · unverdicted · none · ref 38
InSpatio-WorldFM is a frame-independent generative model that uses explicit 3D anchors and spatial memory to deliver real-time multi-view consistent spatial intelligence via a three-stage training pipeline from pretrained diffusion models.
World-R1: Reinforcing 3D Constraints for Text-to-Video Generation cs.CV · 2026-04-27 · unverdicted · none · ref 9 · 3 links
World-R1 applies reinforcement learning via Flow-GRPO and a text dataset to align text-to-video models with 3D constraints from pre-trained foundation models, improving consistency while keeping original visual quality.
Evolution of Video Generative Foundations cs.CV · 2026-04-07 · unverdicted · none · ref 288
This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.
CityRAG: Stepping Into a City via Spatially-Grounded Video Generation cs.CV · 2026-04-21 · unreviewed · ref 64
From Synchrony to Sequence: Exo-to-Ego Generation via Interpolation cs.CV · 2026-04-15 · unreviewed · ref 48

Video world models with long-term spatial memory

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer