TIE derives a sinc-based interval encoding from Temporal Integrability and Duration Invariance principles, raising human-verified temporal constraint satisfaction from 77.34% to 96.03% while preserving visual quality in DiT models.
hub Canonical reference
Control-a-video: Controllable text-to-video generation with diffusion models
Canonical reference. 80% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
A synthetic data pipeline and fine-tuned video model enable generative editing to move object 3D trajectories in videos while keeping relative motion.
VACE unifies reference-to-video generation, video-to-video editing, and masked video-to-video editing in one Diffusion Transformer framework using a Video Condition Unit for inputs and a Context Adapter for task injection.
A new framework factorizes weather video synthesis into semantic appearance anchoring, physics-informed Gaussian particle simulation under gravity/wind/turbulence, and geometry-grounded alignment to produce diverse realistic weather effects.
OmniDirector introduces a grid-based camera representation and hierarchical prompt agent for multi-shot camera cloning in video diffusion models trained on million-scale unpaired data.
ReactiveGWM introduces a decoupled diffusion architecture for player-NPC interactions that learns game-agnostic response logic for zero-shot strategy transfer across games.
VERTIGO post-trains camera trajectory generators with visual preference signals from Unity-rendered previews scored by a cinematically fine-tuned VLM, cutting character off-screen rates from 38% to near zero while improving framing and prompt adherence.
Video Prediction Policy conditions robot action learning on future-frame predictions inside fine-tuned video diffusion models, yielding 18.6% relative gains on Calvin ABC-D and 31.6% higher real-world success rates.
CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.
KGEdit uses an ambiguity-aware knowledge graph and structured injection modules to improve semantic control and temporal consistency in training-free text-to-video diffusion models.
GlowGS improves 3D Gaussian Splatting in nighttime glow scenes via semantic feature generation from diffusion models and novel-view semantic learning with vision foundation models.
DriveCtrl is a depth-conditioned controllable framework that generates realistic driving videos from simulation while preserving annotations and scene dynamics.
DriVerse is a generative model that simulates driving scenes from an image and trajectory using multimodal prompting and motion alignment, achieving better performance on nuScenes and Waymo datasets with minimal training.
I2VGen-XL applies cascaded diffusion models with a base stage for semantic preservation via hierarchical encoders and a refinement stage for detail and resolution, trained on 35 million text-video and 6 billion text-image pairs.
EchoTorrent combines multi-teacher distillation, adaptive CFG calibration, hybrid long-tail forcing, and VAE decoder refinement to enable few-pass autoregressive streaming video generation with improved temporal consistency and audio-lip sync.
Coherence-first rendering with 15 FPS anchors plus FSR4 upsampling to 30 FPS preserves scene geometry and identity longer than native 30 FPS generation across tested forest, sword, desert, and snow scenes, with LPIPS favoring the coherence branch.
The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.
citing papers explorer
-
DriVerse: Navigation World Model for Driving Simulation via Multimodal Trajectory Prompting and Motion Alignment
DriVerse is a generative model that simulates driving scenes from an image and trajectory using multimodal prompting and motion alignment, achieving better performance on nuScenes and Waymo datasets with minimal training.