A synthetic data pipeline and fine-tuned video model enable generative editing to move object 3D trajectories in videos while keeping relative motion.
hub Canonical reference
Control-a-video: Controllable text-to-video generation with diffusion models
Canonical reference. 80% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
VACE unifies reference-to-video generation, video-to-video editing, and masked video-to-video editing in one Diffusion Transformer framework using a Video Condition Unit for inputs and a Context Adapter for task injection.
ReactiveGWM introduces a decoupled diffusion architecture for player-NPC interactions that learns game-agnostic response logic for zero-shot strategy transfer across games.
MMControl adds multi-modal controls for identity, timbre, pose, and layout to unified audio-video diffusion models via dual-stream injection and adjustable guidance scaling.
VERTIGO post-trains camera trajectory generators with visual preference signals from Unity-rendered previews scored by a cinematically fine-tuned VLM, cutting character off-screen rates from 38% to near zero while improving framing and prompt adherence.
Video Prediction Policy conditions robot action learning on future-frame predictions inside fine-tuned video diffusion models, yielding 18.6% relative gains on Calvin ABC-D and 31.6% higher real-world success rates.
CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.
KGEdit uses an ambiguity-aware knowledge graph and structured injection modules to improve semantic control and temporal consistency in training-free text-to-video diffusion models.
GlowGS improves 3D Gaussian Splatting in nighttime glow scenes via semantic feature generation from diffusion models and novel-view semantic learning with vision foundation models.
DriVerse is a generative model that simulates driving scenes from an image and trajectory using multimodal prompting and motion alignment, achieving better performance on nuScenes and Waymo datasets with minimal training.
I2VGen-XL applies cascaded diffusion models with a base stage for semantic preservation via hierarchical encoders and a refinement stage for detail and resolution, trained on 35 million text-video and 6 billion text-image pairs.
EchoTorrent combines multi-teacher distillation, adaptive CFG calibration, hybrid long-tail forcing, and VAE decoder refinement to enable few-pass autoregressive streaming video generation with improved temporal consistency and audio-lip sync.
The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.
citing papers explorer
-
TrajectoryMover: Generative Movement of Object Trajectories in Videos
A synthetic data pipeline and fine-tuned video model enable generative editing to move object 3D trajectories in videos while keeping relative motion.
-
VACE: All-in-One Video Creation and Editing
VACE unifies reference-to-video generation, video-to-video editing, and masked video-to-video editing in one Diffusion Transformer framework using a Video Condition Unit for inputs and a Context Adapter for task injection.
-
ReactiveGWM: Steering NPC in Reactive Game World Models
ReactiveGWM introduces a decoupled diffusion architecture for player-NPC interactions that learns game-agnostic response logic for zero-shot strategy transfer across games.
-
MMControl: Unified Multi-Modal Control for Joint Audio-Video Generation
MMControl adds multi-modal controls for identity, timbre, pose, and layout to unified audio-video diffusion models via dual-stream injection and adjustable guidance scaling.
-
VERTIGO: Visual Preference Optimization for Cinematic Camera Trajectory Generation
VERTIGO post-trains camera trajectory generators with visual preference signals from Unity-rendered previews scored by a cinematically fine-tuned VLM, cutting character off-screen rates from 38% to near zero while improving framing and prompt adherence.
-
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
Video Prediction Policy conditions robot action learning on future-frame predictions inside fine-tuned video diffusion models, yielding 18.6% relative gains on Calvin ABC-D and 31.6% higher real-world success rates.
-
CameraCtrl: Enabling Camera Control for Text-to-Video Generation
CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.
-
KGEdit: Ambiguity-Aware Knowledge Graphs for Training-Free Precise Video Generation and Editing
KGEdit uses an ambiguity-aware knowledge graph and structured injection modules to improve semantic control and temporal consistency in training-free text-to-video diffusion models.
-
GlowGS: Generative Semantic Feature Learning for 3D Gaussian Splatting in Nighttime Glow Scenes
GlowGS improves 3D Gaussian Splatting in nighttime glow scenes via semantic feature generation from diffusion models and novel-view semantic learning with vision foundation models.
-
DriVerse: Navigation World Model for Driving Simulation via Multimodal Trajectory Prompting and Motion Alignment
DriVerse is a generative model that simulates driving scenes from an image and trajectory using multimodal prompting and motion alignment, achieving better performance on nuScenes and Waymo datasets with minimal training.
-
I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models
I2VGen-XL applies cascaded diffusion models with a base stage for semantic preservation via hierarchical encoders and a refinement stage for detail and resolution, trained on 35 million text-video and 6 billion text-image pairs.
-
EchoTorrent: Towards Swift, Sustained, and Streaming Multi-Modal Video Generation
EchoTorrent combines multi-teacher distillation, adaptive CFG calibration, hybrid long-tail forcing, and VAE decoder refinement to enable few-pass autoregressive streaming video generation with improved temporal consistency and audio-lip sync.
-
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.
- TIE: Time Interval Encoding for Video Generation over Events