{"total":17,"items":[{"citing_arxiv_id":"2606.29020","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Semantic-Aware, Physics-Informed, Geometry-Grounded Weather Video Synthesis","primary_cat":"cs.CV","submitted_at":"2026-06-27T17:38:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A new framework factorizes weather video synthesis into semantic appearance anchoring, physics-informed Gaussian particle simulation under gravity/wind/turbulence, and geometry-grounded alignment to produce diverse realistic weather effects.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29509","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"KGEdit: Ambiguity-Aware Knowledge Graphs for Training-Free Precise Video Generation and Editing","primary_cat":"cs.CV","submitted_at":"2026-05-28T07:31:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"KGEdit uses an ambiguity-aware knowledge graph and structured injection modules to improve semantic control and temporal consistency in training-free text-to-video diffusion models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23602","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GlowGS: Generative Semantic Feature Learning for 3D Gaussian Splatting in Nighttime Glow Scenes","primary_cat":"cs.CV","submitted_at":"2026-05-22T13:11:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GlowGS improves 3D Gaussian Splatting in nighttime glow scenes via semantic feature generation from diffusion models and novel-view semantic learning with vision foundation models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15256","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ReactiveGWM: Steering NPC in Reactive Game World Models","primary_cat":"cs.CV","submitted_at":"2026-05-14T17:52:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ReactiveGWM introduces a decoupled diffusion architecture for player-NPC interactions that learns game-agnostic response logic for zero-shot strategy transfer across games.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15116","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DriveCtrl: Conditioned Sim-to-Real Driving Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-14T17:29:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DriveCtrl is a depth-conditioned controllable framework that generates realistic driving videos from simulation while preserving annotations and scene dynamics.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.02586","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Fewer, Better Frames: A Compute-Normalized Proof of Concept for Coherence-First World-Model Rendering with Model-Guided FSR4 Frame Generation","primary_cat":"cs.GR","submitted_at":"2026-05-11T16:42:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Coherence-first rendering with 15 FPS anchors plus FSR4 upsampling to 30 FPS preserves scene geometry and identity longer than native 30 FPS generation across tested forest, sword, desert, and snow scenes, with LPIPS favoring the coherence branch.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10543","ref_index":6,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TIE: Time Interval Encoding for Video Generation over Events","primary_cat":"cs.CV","submitted_at":"2026-05-11T13:23:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TIE derives a sinc-based interval encoding from Temporal Integrability and Duration Invariance principles, raising human-verified temporal constraint satisfaction from 77.34% to 96.03% while preserving visual quality in DiT models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"annotations are often automatic and imperfect, and reaches comparable visual and temporal-control quality in fewer training steps. The code and dataset are available athttps://github.com/MatrixTeam-AI/TIE. 1 Introduction The landscape of video generation has undergone a paradigm shift, with Diffusion Transformers (DiT) pushing the boundaries of visual fidelity [23, 32, 20, 21, 42, 27, 2] and creative customizability [6, 15, 34, 9, 40]. Beyond artistic content creation, these models are increasingly envisioned as world simulators[ 12] for robotics [1] and interactive agents [ 39, 3, 8, 30]. In robotics research, video-generation-based data synthesis [47] has emerged as a powerful tool for policy pre-training and data enrichment [13, 48]. However, the transition from \"visually pleasing\" to \"functionally useful\""},{"citing_arxiv_id":"2604.19679","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MMControl: Unified Multi-Modal Control for Joint Audio-Video Generation","primary_cat":"cs.CV","submitted_at":"2026-04-21T16:57:23+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"Letz aud ={z aud,1, zaud,2, . . . , zaud,k} be the reference audio latent sequence from the Audio V AE, wherek is the number of latent tokens for the reference duration. A sequence of silence latents zsil of length t serves as placeholder sequence for generation. The acoustic MMCU is formulated as: A={z aud,1, . . . , zaud,k}+{z sil,1, . . . , zsil,t}, Ma ={0} k +{1} t, (3) where {0}k and {1}t are constant sequences of zeros and ones with lengths k and t. This ensures the model observes a reference audio segment to fix the speaker identity before the generative phase. While zsil currently represents placeholder sequence, this formulation allows for future extensions using specific acoustic controls like pitch or energy contours."},{"citing_arxiv_id":"2604.02467","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VERTIGO: Visual Preference Optimization for Cinematic Camera Trajectory Generation","primary_cat":"cs.CV","submitted_at":"2026-04-02T18:58:56+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VERTIGO post-trains camera trajectory generators with visual preference signals from Unity-rendered previews scored by a cinematically fine-tuned VLM, cutting character off-screen rates from 38% to near zero while improving framing and prompt adherence.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Reinforcement-learning-based post-training has become a prominent paradigm for aligning generative models with human preferences. In image and video generation, methods such as RLHF, RLAIF, and DPO have improved motion fidelity and text-visual consistency by distilling human-like preferences into diffusion models [1,34,64]. Works like Control-A-Video [4] further demonstrate reward feedback learning for controllable video diffusion, using visual reward signals derived from rendered frames. Self-rewarding and AI-generated preference signals further reduce dependence on costly labels and help mitigate reward hacking [28,29,39]. Recent Vision-Language Models (VLMs) also advance video understanding with finer temporal reasoning and narrative coherence [38,49],"},{"citing_arxiv_id":"2603.29092","ref_index":7,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TrajectoryMover: Generative Movement of Object Trajectories in Videos","primary_cat":"cs.CV","submitted_at":"2026-03-31T00:15:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A synthetic data pipeline and fine-tuned video model enable generative editing to move object 3D trajectories in videos while keeping relative motion.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"1 Conditional Video Generation and Editing Conditional video diffusion models extend base text- or image-conditioned ar- chitectures by incorporating auxiliary control signals. Inspired by the success 4 K. Chhatre et al. of ControlNet [42] on controllable image generation, several approaches have introduced ControlNet-style hypernetworks to video synthesis [15,7,18,33,4,19,31]. These methods adapt temporal mechanisms to enable guidance via diverse visual signals, such as depth maps, edge maps, and camera parameters. Alternatively, other frameworks utilize source video input to facilitate video-to-video editing. These approaches generally address tasks such as targeted appearance editing [19,21,1,31], global stylization [40,25], and novel view synthesis [2,16,32,17,41]."},{"citing_arxiv_id":"2602.13669","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"EchoTorrent: Towards Swift, Sustained, and Streaming Multi-Modal Video Generation","primary_cat":"cs.CV","submitted_at":"2026-02-14T08:32:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"EchoTorrent combines multi-teacher distillation, adaptive CFG calibration, hybrid long-tail forcing, and VAE decoder refinement to enable few-pass autoregressive streaming video generation with improved temporal consistency and audio-lip sync.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.18576","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DriVerse: Navigation World Model for Driving Simulation via Multimodal Trajectory Prompting and Motion Alignment","primary_cat":"cs.RO","submitted_at":"2025-04-22T20:58:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DriVerse is a generative model that simulates driving scenes from an image and trajectory using multimodal prompting and motion alignment, achieving better performance on nuScenes and Waymo datasets with minimal training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.07598","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VACE: All-in-One Video Creation and Editing","primary_cat":"cs.CV","submitted_at":"2025-03-10T17:57:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VACE unifies reference-to-video generation, video-to-video editing, and masked video-to-video editing in one Diffusion Transformer framework using a Video Condition Unit for inputs and a Context Adapter for task injection.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2412.14803","ref_index":91,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations","primary_cat":"cs.CV","submitted_at":"2024-12-19T12:48:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Video Prediction Policy conditions robot action learning on future-frame predictions inside fine-tuned video diffusion models, yielding 18.6% relative gains on Calvin ABC-D and 31.6% higher real-world success rates.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2404.02101","ref_index":106,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CameraCtrl: Enabling Camera Control for Text-to-Video Generation","primary_cat":"cs.CV","submitted_at":"2024-04-02T16:52:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2402.17177","ref_index":186,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models","primary_cat":"cs.CV","submitted_at":"2024-02-27T03:30:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Horwitz, D. Valevski, A. R. Acha, Y . Matias, Y . Pritch, Y . Leviathan, and Y . Hoshen, \"Dreamix: Video diffusion models are general video editors,\" arXiv preprint arXiv:2302.01329 , 2023. [185] J. H. Liew, H. Yan, J. Zhang, Z. Xu, and J. Feng, \"Magicedit: High-fidelity and temporally coherent video editing,\"arXiv preprint arXiv:2308.14749, 2023. [186] W. Chen, J. Wu, P. Xie, H. Wu, J. Li, X. Xia, X. Xiao, and L. Lin, \"Control-a-video: Controllable text-to-video generation with diffusion models,\"arXiv preprint arXiv:2305.13840, 2023. 36 [187] W. Chai, X. Guo, G. Wang, and Y . Lu, \"Stablevideo: Text-driven consistency-aware diffusion video editing,\" inProceedings of the IEEE/CVF International Conference on Computer Vision, pp."},{"citing_arxiv_id":"2311.04145","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models","primary_cat":"cs.CV","submitted_at":"2023-11-07T17:16:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"I2VGen-XL applies cascaded diffusion models with a base stage for semantic preservation via hierarchical encoders and a refinement stage for detail and resolution, trained on 35 million text-video and 6 billion text-image pairs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}