{"total":18,"items":[{"citing_arxiv_id":"2606.27964","ref_index":61,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Directing the World: Fast Autoregressive Video Generation with Compositional Human-Camera Control","primary_cat":"cs.CV","submitted_at":"2026-06-26T11:08:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A decoupled-control autoregressive video model using Fast-Slow Memory training, dynamic projection, and staged camera control to produce stable long-horizon outputs with human and viewpoint guidance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.31336","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DecMem: Towards Minute-Long Consistent World Generation with Decoupled Memory","primary_cat":"cs.CV","submitted_at":"2026-05-29T14:17:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DecMem proposes a decoupled memory system using sparse global and anchored local components to enable consistent minute-long controllable video generation in world models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27589","ref_index":69,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"What-If World: A Causal Benchmark for General World Models in Embodied Scenarios","primary_cat":"cs.CV","submitted_at":"2026-05-26T19:02:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"What-If World is a new paired-prompt benchmark showing that nine state-of-the-art video generation models achieve at most 52% on causal intervention tests and cluster near 28% for open-source systems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.26316","ref_index":64,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"E$^3$C: Video Generation with 3D Environmental Memory and Ego-Exo Human Pose Control","primary_cat":"cs.CV","submitted_at":"2026-05-25T20:13:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"E³C is a video diffusion model that disentangles persistent 3D scene structure via point-cloud memory from human dynamics via ego-exo pose controls for improved egocentric video generation on the Nymeria dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22814","ref_index":67,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Remember to be Curious: Episodic Context and Persistent Worlds for 3D Exploration","primary_cat":"cs.LG","submitted_at":"2026-05-21T17:58:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A curiosity-based 3D exploration policy that pairs persistent online 3D reconstruction with episodic sequence modeling over RGB to outperform active-mapping baselines on HM3D and transfer zero-shot to Gibson and synthetic worlds.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18365","ref_index":78,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-18T13:17:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GeoFlow adds a geometry-consistency reward based on rigid camera flow and object appearance preservation, integrated via reinforcement fine-tuning to improve geometric coherence in video generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15182","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Warp-as-History: Generalizable Camera-Controlled Video Generation from One Training Video","primary_cat":"cs.CV","submitted_at":"2026-05-14T17:58:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Warp-as-History enables zero-shot camera trajectory following in frozen video models by supplying camera-warped pseudo-history, with single-video LoRA fine-tuning improving generalization to unseen videos.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09442","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SWIFT: Prompt-Adaptive Memory for Efficient Interactive Long Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-10T09:37:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SWIFT introduces a semantic injection cache with head-wise updates and an adaptive dynamic window plus segment anchors to achieve efficient multi-prompt long video generation at 22.6 FPS while preserving quality in causal diffusion models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"arXiv preprint arXiv:2503.20314, 2025. [26] H. Wang, C.-Y . Ma, Y .-C. Liu, J. Hou, T. Xu, J. Wang, F. Juefei-Xu, Y . Luo, P. Zhang, T. Hou, et al. Lingen: Towards high-resolution minute-length text-to-video generation with linear computational complexity. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2578-2588, 2025. [27] T. Wu, S. Yang, R. Po, Y . Xu, Z. Liu, D. Lin, and G. Wetzstein. Video world models with long-term spatial memory.arXiv preprint arXiv:2506.05284, 2025. [28] Z. Wu, A. Siarohin, W. Menapace, I. Skorokhodov, Y . Fang, V . Chordia, I. Gilitschenski, and S. Tulyakov. Mind the time: Temporally-controlled multi-event video generation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 23989-24000,"},{"citing_arxiv_id":"2605.01694","ref_index":65,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Latent State Design for World Models under Sufficiency Constraints","primary_cat":"cs.AI","submitted_at":"2026-05-03T03:19:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[63] Fuxiang Yang, Donglin Di, Lulu Tang, Xuancheng Zhang, Lei Fan, Hao Li, Chen Wei, Tonghua Su, and Baorui Ma. Chain of world: World model thinking in latent motion, 2026. URLhttps://arxiv.org/abs/2603.03195. [64] Mengjiao Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators, 2023. URLhttps://arxiv.org/abs/2310.06114. [65] Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Sejune Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, Lars Liden, Kimin Lee, Jianfeng Gao, Luke Zettlemoyer, Dieter Fox, and Minjoon Seo. Latent action pretraining from videos, 2024. URLhttps://arxiv.org/abs/2410.11758. [66] Weirui Ye, Shaohuai Liu, Thanard Kurutach, Pieter Abbeel, and Yang Gao."},{"citing_arxiv_id":"2604.24764","ref_index":9,"ref_count":3,"confidence":0.9,"is_internal_anchor":false,"paper_title":"World-R1: Reinforcing 3D Constraints for Text-to-Video Generation","primary_cat":"cs.CV","submitted_at":"2026-04-27T17:59:56+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Steerx: Creating any camera-free 3d and 4d scenes with geometric steering. InInt. Conf. Comput. Vis., pages 27326-27337, 2025. [8] Tianyu Huang, Wangguandong Zheng, Tengfei Wang, Yuhao Liu, Zhenwei Wang, Junta Wu, Jie Jiang, Hui Li, Rynson WH Lau, Wangmeng Zuo, and Chunchao Guo. Voyager: Long-range and world-consistent video diffusion for explorable 3d scene generation.arXiv preprint arXiv:2506.04225, 2025. [9] Tong Wu, Shuai Yang, Ryan Po, Yinghao Xu, Ziwei Liu, Dahua Lin, and Gordon Wetzstein. Video world models with long-term spatial memory.arXiv preprint arXiv:2506.05284, 2025. [10] Haoyu Wu, Diankun Wu, Tianyu He, Junliang Guo, Yang Ye, Yueqi Duan, and Jiang Bian. Geometry forcing: Marrying video diffusion and 3d representation for consistent world"},{"citing_arxiv_id":"2604.19741","ref_index":64,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CityRAG: Stepping Into a City via Spatially-Grounded Video Generation","primary_cat":"cs.CV","submitted_at":"2026-04-21T17:59:03+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"finetunedbasedontherequirementsofdownstreamapplications.Ourapplication requires long-term consistency, pose control, and integration of external context. Long-termconsistency.Worksinlong-contextorautoregressivegeneration[7, 8,27,44,52,65,70] maintain consistency by balancing computational efficiency and storing past samples. Another line of work creates an explicit memory like point clouds [22,46,64,68]. However, these works rarely show the capacity to generate minutes-long videos without significant degradation, and have an or- thogonal focus to our work. CityRAG retrieves external context, rather than past samples, to maintain consistency. Pose-conditioning.Pose-conditioned models [2,23,46,54,56,60,74] finetune a base generative model on camera poses, often in the form of camera parameters"},{"citing_arxiv_id":"2604.18564","ref_index":54,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MultiWorld: Scalable Multi-Agent Multi-View Video World Models","primary_cat":"cs.CV","submitted_at":"2026-04-20T17:52:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"view counts, and synthesizes different views in parallel for high efficiency. Experiments on multi-player game environments and multi-robot ma- nipulation tasks demonstrate that MultiWorld outperforms baselines in video fidelity, action-following ability, and multi-view consistency. Keywords:Video World Models·Multi-Agent Systems·Multi-View Consistency 1 Introduction Videoworldmodels[15,19,26,41,50,54,57]haveachievedsignificantsuccessinac- curately predicting future environment dynamics conditioned on text or actions. However, existing video world models implicitly assume a single agent in the simulated environment, ignoring interactions and interdependencies among mul- tiple agents acting simultaneously, as in collaborative robotics and multi-player"},{"citing_arxiv_id":"2604.13793","ref_index":48,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Synchrony to Sequence: Exo-to-Ego Generation via Interpolation","primary_cat":"cs.CV","submitted_at":"2026-04-15T12:32:25+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Master [2] achieve fine-grained camera control by directly manipulating tem- poral attention within large video generators. World models enable interactive environments with real-time camera control: WorldPlay [43], Genie 3 [37], and LingBot [44]. Long-horizon video generators, such as Diffusion Forcing Trans- formers [42] and MotionStream [41] have been demonstrated to follow camera controls as well. SPMem [48], VMem [24], Gen3C [39], and TrajectoryCrafter [52] maintain 3D representations during video generation, to produce consistent con- tent throughout viewpoints. These camera-controlled video generation methods synthesize videos by fol- lowing a continuous camera motion trajectory during generation. However, they do not address the task of transforming the viewpoint of an existing video, leav-"},{"citing_arxiv_id":"2604.13036","ref_index":117,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Lyra 2.0: Explorable Generative 3D Worlds","primary_cat":"cs.CV","submitted_at":"2026-04-14T17:59:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Lyra 2.0 produces persistent 3D-consistent video sequences for large explorable worlds by using per-frame geometry for information routing and self-augmented training to correct temporal drift.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"535 58.96 24.60 76.7769.54 0.0680.350 0.589 79.07 21.75 75.5470.91 0.054 Yume1.5 [70] 0.342 0.719 84.84 22.80 66.73 - 0.095 0.348 0.702 89.69 28.68 78.63 - 0.083 CaM [128] 0.370 0.562 50.43 35.19 82.63 42.71 0.069 0.367 0.605 59.20 34.22 82.83 31.86 0.056 VMem [51] 0.331 0.744 120.59 18.54 76.14 0.68 0.268 0.338 0.767 136.48 16.21 70.54 0.00 0.263 SPMem [117] 0.3830.522 53.77 38.32 82.79 62.05 0.074 0.383 0.571 60.11 34.41 79.68 45.07 0.059 HY-WorldPlay [37] 0.373 0.765 139.36 4.79 54.62 - 0.092 0.380 0.796 163.54 3.24 48.22 - 0.084 Ours 0.388 0.498 43.4344.54 87.46 64.67 0.0760.3840.552 51.33 43.35 85.0763.87 0.069 Ours DMD0.359 0.507 43.63 45.21 88.5765.64 0.088 0.3620.545 49.7143.02 78.91 58.12 0.077"},{"citing_arxiv_id":"2604.10578","ref_index":51,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Rein3D: Reinforced 3D Indoor Scene Generation with Panoramic Video Diffusion Models","primary_cat":"cs.CV","submitted_at":"2026-04-12T10:55:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Rein3D generates photorealistic, globally consistent 3D indoor scenes by using a restore-and-refine process where radial panoramic videos are restored via diffusion models and then used to update a 3D Gaussian field.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"dress this by leveraging strong 2D visual priors. However, standard image-based methods [23,56,60-62] often suffer from accumulated geometric errors. While strategies like explicit constraints or multi-view synthesis [7,40,66] alleviate this issue, they remain computationally intensive and operationally cumber- some. In contrast, video diffusion methods [3,11,51], especially Video-to-Video (V2V) approaches guided by point cloud priors, have demonstrated impressive results. Nevertheless, these methods typically operate with a limited Field-of- View (FoV). Covering an entire scene requires stitching many views along care- fully designed camera trajectories, which is computationally expensive and often yields weaker global consistency than panoramic representations."},{"citing_arxiv_id":"2604.06339","ref_index":288,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Evolution of Video Generative Foundations","primary_cat":"cs.CV","submitted_at":"2026-04-07T18:17:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"wave of hybrid models integrates \"3D Memory\" directly 17 into the video diffusion process, combining dynamic expres- siveness with structural rigidity. One dominant approach employs explicit geometric structures like point clouds or TSDF volumes. While early works like ViewCrafter [285] re- lied on static priors, advanced systems such as Spatia [286], VMem [287], SPMem [288], and EvoWorld [289] construct globally updatable memory banks that are projected back into the generation loop, ensuring revisited locations re- tain their structure. WorldWarp [290] further explores 3D Gaussian Splatting (3DGS) for representation, though opti- mization latency remains a bottleneck. Conversely, to avoid the computational cost of explicit reconstruction, implicit"},{"citing_arxiv_id":"2603.11911","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"InSpatio-WorldFM: An Open-Source Real-Time Generative Frame Model","primary_cat":"cs.CV","submitted_at":"2026-03-12T13:28:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"InSpatio-WorldFM is a frame-independent generative model that uses explicit 3D anchors and spatial memory to deliver real-time multi-view consistent spatial intelligence via a three-stage training pipeline from pretrained diffusion models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.07982","ref_index":79,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling","primary_cat":"cs.CV","submitted_at":"2025-07-10T17:55:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Geometry Forcing aligns video diffusion representations with geometric foundation model features via angular cosine and scale regression objectives to improve 3D consistency in generated videos.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}