CameraCtrl: Enabling Camera Control for Text-to-Video Generation
Pith reviewed 2026-05-13 02:00 UTC · model grok-4.3
The pith
A plug-and-play module adds precise camera pose control to existing text-to-video diffusion models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce CameraCtrl, enabling accurate camera pose control for video diffusion models. Our approach explores effective camera trajectory parameterization along with a plug-and-play camera pose control module that is trained on top of a video diffusion model, leaving other modules of the base model untouched. Moreover, a comprehensive study on the effect of various training datasets is conducted, suggesting that videos with diverse camera distributions and similar appearance to the base model indeed enhance controllability and generalization. Experimental results demonstrate the effectiveness of CameraCtrl in achieving precise camera control with different video generation models.
What carries the argument
The plug-and-play camera pose control module, which takes parameterized camera trajectories as input and injects the corresponding signals into a frozen video diffusion model during generation.
If this is right
- Precise camera control becomes available for multiple different video diffusion models without retraining them.
- Controllability and generalization improve when the training data includes wide ranges of camera paths and visual styles matching the base model.
- Users can combine text prompts with explicit camera pose sequences to produce videos that follow chosen cinematic movements.
Where Pith is reading between the lines
- The modular design opens the possibility of adding further independent control signals, such as object motion or lighting, on top of the same base model.
- This separation of concerns could support interactive applications where users adjust camera paths after an initial generation pass.
- The emphasis on dataset camera diversity suggests that future work might systematically catalog and release camera-annotated video collections to boost similar control methods.
Load-bearing premise
Videos with diverse camera distributions and appearance similar to the base model can be collected in sufficient quantity and the added control module will not reduce the base model's visual quality.
What would settle it
A side-by-side comparison where the generated video frames do not exhibit the exact camera motion specified in the input trajectory, or where image quality metrics fall below those of the unmodified base model.
read the original abstract
Controllability plays a crucial role in video generation, as it allows users to create and edit content more precisely. Existing models, however, lack control of camera pose that serves as a cinematic language to express deeper narrative nuances. To alleviate this issue, we introduce CameraCtrl, enabling accurate camera pose control for video diffusion models. Our approach explores effective camera trajectory parameterization along with a plug-and-play camera pose control module that is trained on top of a video diffusion model, leaving other modules of the base model untouched. Moreover, a comprehensive study on the effect of various training datasets is conducted, suggesting that videos with diverse camera distributions and similar appearance to the base model indeed enhance controllability and generalization. Experimental results demonstrate the effectiveness of CameraCtrl in achieving precise camera control with different video generation models, marking a step forward in the pursuit of dynamic and customized video storytelling from textual and camera pose inputs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CameraCtrl, a plug-and-play camera pose control module for text-to-video diffusion models. It proposes a camera trajectory parameterization, trains the module atop frozen base video diffusion models, and conducts a dataset study claiming that videos with diverse camera distributions and similar appearance to the base model improve controllability and generalization. Experiments are said to demonstrate accurate camera control across different video generation models from text and pose inputs.
Significance. If the central claims hold, this would represent a meaningful advance in controllable video generation by adding precise cinematic camera control without full model retraining. The plug-and-play design and dataset ablation study are positive elements that could facilitate adoption if supported by rigorous evidence of preserved base-model quality.
major comments (2)
- [Abstract] Abstract: The central claim that the control module enables accurate camera control 'leaving other modules of the base model untouched' is load-bearing, yet the abstract provides no quantitative metrics (FVD, CLIP score, or similar) comparing base-model quality before versus after module insertion and training. This omission prevents verification that distribution shift has not occurred.
- [Abstract] Abstract / dataset study: The claim that 'videos with diverse camera distributions and similar appearance to the base model indeed enhance controllability and generalization' is presented as a finding, but no ablation tables, quantitative controllability scores, or error analysis are referenced to support the cross-dataset conclusions.
minor comments (1)
- The camera trajectory parameterization is described at a high level; a dedicated subsection with explicit equations for pose encoding and injection into the diffusion process would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the thoughtful review and constructive suggestions. We address the major comments below and have revised the manuscript accordingly to strengthen the presentation of our results.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that the control module enables accurate camera control 'leaving other modules of the base model untouched' is load-bearing, yet the abstract provides no quantitative metrics (FVD, CLIP score, or similar) comparing base-model quality before versus after module insertion and training. This omission prevents verification that distribution shift has not occurred.
Authors: We agree that including quantitative metrics in the abstract would better support this central claim. In the revised manuscript, we will update the abstract to reference the FVD and CLIP score comparisons from our experiments, which demonstrate that the base model quality is largely preserved after inserting and training the control module. These metrics are reported in detail in Section 4.1 of the paper. revision: yes
-
Referee: [Abstract] Abstract / dataset study: The claim that 'videos with diverse camera distributions and similar appearance to the base model indeed enhance controllability and generalization' is presented as a finding, but no ablation tables, quantitative controllability scores, or error analysis are referenced to support the cross-dataset conclusions.
Authors: The dataset ablation study with quantitative results, including controllability scores and error analysis across different datasets, is provided in Section 4.3 with supporting tables. To address this comment, we will revise the abstract to more explicitly summarize the key quantitative findings from this study, such as improved generalization on diverse camera trajectories. This will help readers connect the claim to the evidence without requiring them to immediately consult the full text. revision: yes
Circularity Check
No circularity in claimed derivation or results
full rationale
The paper presents an empirical engineering contribution: a trainable plug-and-play control module inserted into a frozen video diffusion backbone and trained via standard supervised learning on external video datasets. No equations, predictions, or first-principles claims are offered that reduce to fitted parameters or self-citations by construction. The central statements (accurate camera control, dataset effects on generalization) are supported by experimental comparisons rather than definitional or self-referential loops. This is the common case of a self-contained applied ML paper.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption A separate control module can be trained to steer camera pose while leaving the base video diffusion model parameters untouched.
- domain assumption Dataset characteristics (diverse camera distributions and appearance similarity to base model) directly determine controllability and generalization.
Forward citations
Cited by 54 Pith papers
-
Preserve, Reveal, Expand: Faithful 4D Video Editing with Region-Aware Conditioning
PREX decomposes target 4D video volumes into Preserve, Reveal, and Expand roles with a region-aware adapter on a frozen diffusion backbone, trained via proxy tasks, and introduces the PREBench benchmark to reduce regi...
-
DEVIS-GRPO: Unleashing GRPO on Dynamic Extreme View Synthesis
DEVIS-GRPO applies online policy gradients with an accumulative small-to-large view sampling strategy and multi-level rewards to improve trajectory-controlled extreme view video generation, reporting gains on Kubric-4...
-
R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow
R-DMesh generates high-fidelity 4D meshes aligned to video by disentangling base mesh, motion, and a learned rectification jump offset inside a VAE, then using Triflow Attention and rectified-flow diffusion.
-
GTA: Advancing Image-to-3D World Generation via Geometry Then Appearance Video Diffusion
GTA generates 3D worlds from single images via a two-stage video diffusion process that prioritizes geometry before appearance to improve structural consistency.
-
MoCam: Unified Novel View Synthesis via Structured Denoising Dynamics
MoCam unifies static and dynamic novel view synthesis by temporally decoupling geometric alignment and appearance refinement within the diffusion denoising process.
-
MoCam: Unified Novel View Synthesis via Structured Denoising Dynamics
MoCam uses structured denoising dynamics in diffusion models to temporally decouple geometric alignment from appearance refinement, enabling unified novel view synthesis that outperforms prior methods on imperfect poi...
-
$h$-control: Training-Free Camera Control via Block-Conditional Gibbs Refinement
h-control introduces block-conditional pseudo-Gibbs refinement for training-free camera control in flow-matching video generators, achieving superior FVD scores on RealEstate10K and DAVIS benchmarks.
-
Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting
Reshoot-Anything trains a diffusion transformer on pseudo multi-view triplets created by cropping and warping monocular videos to achieve temporally consistent video reshooting with robust camera control on dynamic scenes.
-
WorldMark: A Unified Benchmark Suite for Interactive Video World Models
WorldMark is the first public benchmark that standardizes scenes, trajectories, and control interfaces across heterogeneous interactive image-to-video world models.
-
MultiWorld: Scalable Multi-Agent Multi-View Video World Models
MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.
-
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
UniGeo unifies geometric guidance across three levels in video models to reduce geometric drift and improve consistency in camera-controllable image editing.
-
Geometrically Consistent Multi-View Scene Generation from Freehand Sketches
A framework generates consistent multi-view scenes from one freehand sketch via a ~9k-sample dataset, Parallel Camera-Aware Attention Adapters, and Sparse Correspondence Supervision Loss, outperforming baselines in re...
-
Prompt Relay: Inference-Time Temporal Control for Multi-Event Video Generation
Prompt Relay is an inference-time plug-and-play method that penalizes cross-attention to enforce temporal prompt alignment and reduce semantic entanglement in multi-event video generation.
-
Novel View Synthesis as Video Completion
Video diffusion models can be adapted into permutation-invariant generators for sparse novel view synthesis by treating the problem as video completion and removing temporal order cues.
-
MotionScape: A Large-Scale Real-World Highly Dynamic UAV Video Dataset for World Models
MotionScape is a large-scale UAV video dataset with highly dynamic 6-DoF motions, geometric trajectories, and semantic annotations to train world models that better simulate complex 3D dynamics under large viewpoint changes.
-
MoRight: Motion Control Done Right
MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply ...
-
OmniCamera: A Unified Framework for Multi-task Video Generation with Arbitrary Camera Control
OmniCamera disentangles video content and camera motion for multi-task generation with arbitrary camera control via the OmniCAM hybrid dataset and Dual-level Curriculum Co-Training.
-
SparseCam4D: Spatio-Temporally Consistent 4D Reconstruction from Sparse Cameras
SparseCam4D achieves spatio-temporally consistent high-fidelity 4D reconstruction from sparse cameras via a Spatio-Temporal Distortion Field that corrects inconsistencies in generative observations.
-
Setting the Stage: Text-Driven Scene-Consistent Image Generation
A new data pipeline using real photos, entity removal, and image-to-video models plus a cross-view attention loss enables text-driven generation of actors in reference scenes with improved alignment.
-
StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space
A viewpoint-conditioned diffusion model generates stereo image pairs from monocular input in a canonical rectified space without using depth or explicit warping.
-
GimbalDiffusion: Gravity-Aware Camera Control for Video Generation
GimbalDiffusion adds gravity-referenced absolute camera control and null-pitch conditioning to text-to-video diffusion models, trained on full-sphere panoramic data, to support extreme trajectories and reduce prompt e...
-
GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation
GeoFlow adds a geometry-consistency reward based on rigid camera flow and object appearance preservation, integrated via reinforcement fine-tuning to improve geometric coherence in video generation.
-
Warp-as-History: Generalizable Camera-Controlled Video Generation from One Training Video
Warp-as-History enables zero-shot camera trajectory following in frozen video models by supplying camera-warped pseudo-history, with single-video LoRA fine-tuning improving generalization to unseen videos.
-
ReactiveGWM: Steering NPC in Reactive Game World Models
ReactiveGWM introduces a decoupled diffusion architecture for player-NPC interactions that learns game-agnostic response logic for zero-shot strategy transfer across games.
-
UniFixer: A Universal Reference-Guided Fixer for Diffusion-Based View Synthesis
UniFixer is a universal reference-guided framework that fixes spatial, temporal, and backbone-related degradations in diffusion-based view synthesis via coarse-to-fine modules and achieves zero-shot SOTA results on no...
-
$h$-control: Training-Free Camera Control via Block-Conditional Gibbs Refinement
h-control augments hard-replacement guidance with block-conditional pseudo-Gibbs refinement on unobserved latent sites and adaptive 3D patch freezing to achieve superior FVD on RealEstate10K and DAVIS.
-
RealCam: Real-Time Novel-View Video Generation with Interactive Camera Control
RealCam is a causal autoregressive model for real-time camera-controlled video-to-video generation, using cross-frame in-context teacher distillation and loop-closed data augmentation to achieve high fidelity and consistency.
-
Vista4D: Video Reshooting with 4D Point Clouds
Vista4D re-synthesizes dynamic videos from new viewpoints by grounding them in a 4D point cloud built with static segmentation and multiview training.
-
Memorize When Needed: Decoupled Memory Control for Spatially Consistent Long-Horizon Video Generation
A decoupled memory branch with hybrid cues, cross-attention, and gating improves spatial consistency and data efficiency in long-horizon camera-trajectory video generation.
-
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
UniGeo adds unified geometric guidance at three levels in video models to reduce geometric drift and improve structural fidelity in camera-controllable image editing.
-
From Synchrony to Sequence: Exo-to-Ego Generation via Interpolation
Interpolating exo and ego videos into a single continuous sequence lets diffusion sequence models generate more coherent first-person videos than direct conditioning, even without pose interpolation.
-
Lyra 2.0: Explorable Generative 3D Worlds
Lyra 2.0 produces persistent 3D-consistent video sequences for large explorable worlds by using per-frame geometry for information routing and self-augmented training to correct temporal drift.
-
Rein3D: Reinforced 3D Indoor Scene Generation with Panoramic Video Diffusion Models
Rein3D generates photorealistic, globally consistent 3D indoor scenes by using a restore-and-refine process where radial panoramic videos are restored via diffusion models and then used to update a 3D Gaussian field.
-
Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories
A video diffusion model learns a joint distribution over videos and camera trajectories by representing cameras as pixel-aligned ray encodings (raxels) denoised jointly with video frames via decoupled attention.
-
SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations
SceneScribe-1M is a new dataset of 1 million videos with semantic text, camera parameters, dense depth, and consistent 3D point tracks to support monocular depth estimation, scene reconstruction, point tracking, and t...
-
INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling
INSPATIO-WORLD is a real-time framework for high-fidelity 4D scene generation and navigation from monocular videos via STAR architecture with implicit caching, explicit geometric constraints, and distribution-matching...
-
SpatialEdit: Benchmarking Fine-Grained Image Spatial Editing
SpatialEdit provides a benchmark, large synthetic dataset, and baseline model for precise object and camera spatial manipulations in images, with the model beating priors on spatial editing.
-
VERTIGO: Visual Preference Optimization for Cinematic Camera Trajectory Generation
VERTIGO post-trains camera trajectory generators with visual preference signals from Unity-rendered previews scored by a cinematically fine-tuned VLM, cutting character off-screen rates from 38% to near zero while imp...
-
HVG-3D: Bridging Real and Simulation Domains for 3D-Conditional Hand-Object Interaction Video Synthesis
HVG-3D uses a 3D-aware diffusion architecture with ControlNet to synthesize high-fidelity hand-object interaction videos from 3D control signals, achieving state-of-the-art spatial fidelity and temporal coherence on t...
-
Pixel-to-4D: Camera-Controlled Image-to-Video Generation with Dynamic 3D Gaussians
A unified single-pass framework using dynamic 3D Gaussians generates temporally consistent camera-controlled videos from a single image.
-
Pixel-to-4D: Camera-Controlled Image-to-Video Generation with Dynamic 3D Gaussians
Pixel-to-4D builds a dynamic 3D Gaussian representation from one image and samples object motion in a single forward pass to produce camera-controlled videos with claimed state-of-the-art quality and speed on KITTI, W...
-
Less is More: Data-Efficient Adaptation for Controllable Text-to-Video Generation
Fine-tuning text-to-video models on sparse low-quality synthetic data for physical camera controls outperforms fine-tuning on photorealistic data.
-
Diff4Splat: Controllable 4D Scene Generation with Latent Dynamic Reconstruction Models
A feed-forward video latent transformer that predicts time-varying 3D Gaussian primitives from one image to produce controllable 4D scenes with appearance, geometry, and motion.
-
Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling
Geometry Forcing aligns video diffusion representations with geometric foundation model features via angular cosine and scale regression objectives to improve 3D consistency in generated videos.
-
ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis
ViewCrafter tames video diffusion models with point-based 3D guidance and iterative trajectory planning to produce high-fidelity novel views from single or sparse images.
-
CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation
CamCo equips image-to-video generators with Plücker-coordinate camera inputs and epipolar attention to improve 3D consistency and camera controllability.
-
EasyVFX: Frequency-Driven Decoupling for Resource-Efficient VFX Generation
EasyVFX decouples VFX generation via frequency-aware Mixture-of-Experts and test-time training to achieve realistic effects with limited resources.
-
SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer
SANA-WM is a 2.6B-parameter efficient world model that synthesizes minute-scale 720p videos with 6-DoF camera control, trained on 213K public clips in 15 days on 64 H100s and runnable on single GPUs at 36x higher thro...
-
R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow
R-DMesh uses a VAE with a learned rectification jump offset and Triflow Attention inside a rectified-flow diffusion transformer to produce video-aligned 4D meshes despite initial pose misalignment.
-
World-R1: Reinforcing 3D Constraints for Text-to-Video Generation
World-R1 uses Flow-GRPO reinforcement learning and a new text dataset to enforce 3D consistency in text-to-video generation while keeping the original model's visual quality.
-
Matrix-game 2.0: An open-source real-time and streaming interactive world model
Matrix-Game 2.0 introduces a scalable data pipeline, action-injection module, and few-step distillation to enable real-time streaming video generation at 25 FPS from game-engine interactions, with open-sourced weights...
-
Geometry-aware 4D Video Generation for Robot Manipulation
A geometry-aware 4D video generation model trained with cross-view pointmap alignment to produce spatio-temporally consistent future videos from novel viewpoints for robot manipulation.
-
World-R1: Reinforcing 3D Constraints for Text-to-Video Generation
World-R1 uses RL with 3D model feedback and a new text dataset to improve geometric consistency in text-to-video generation while keeping the base model unchanged.
-
Cosmos World Foundation Model Platform for Physical AI
The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.
Reference graph
Works this paper leans on
-
[3]
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
Glide: Towards photorealistic image generation and editing with text-guided diffusion models , author=. arXiv preprint arXiv:2112.10741 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Hierarchical text-conditional image generation with clip latents , author=. arXiv preprint arXiv:2204.06125 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Advances in Neural Information Processing Systems , volume=
Photorealistic text-to-image diffusion models with deep language understanding , author=. Advances in Neural Information Processing Systems , volume=
-
[6]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
High-resolution image synthesis with latent diffusion models , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[7]
eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers
ediffi: Text-to-image diffusion models with an ensemble of expert denoisers , author=. arXiv preprint arXiv:2211.01324 , year=
work page internal anchor Pith review arXiv
-
[8]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Vector quantized diffusion model for text-to-image synthesis , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[9]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Multi-concept customization of text-to-image diffusion , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[11]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Versatile diffusion: Text, images and variations all in one diffusion model , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[12]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[13]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Advancing high-resolution video-language representation with large-scale video transcriptions , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[14]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Frozen in time: A joint video and image encoder for end-to-end retrieval , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[16]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Align your latents: High-resolution video synthesis with latent diffusion models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[17]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[22]
Dreampose: Fashion image-to-video synthesis via stable diffusion,
Dreampose: Fashion image-to-video synthesis via stable diffusion , author=. arXiv preprint arXiv:2304.06025 , year=
-
[25]
Advances in Neural Information Processing Systems , volume=
Light field networks: Neural scene representations with single-evaluation rendering , author=. Advances in Neural Information Processing Systems , volume=
-
[26]
arXiv preprint arXiv:2304.13681 , year=
Ray Conditioning: Trading Photo-consistency for Photo-realism in Multi-view Image Generation , author=. arXiv preprint arXiv:2304.13681 , year=
-
[27]
arXiv preprint arXiv:2312.04551 , year=
Free3D: Consistent Novel View Synthesis without 3D Representation , author=. arXiv preprint arXiv:2312.04551 , year=
-
[29]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Objaverse: A universe of annotated 3d objects , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[30]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Mvimgnet: A large-scale dataset of multi-view images , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[32]
Advances in Neural Information Processing Systems , volume=
Videocomposer: Compositional video synthesis with motion controllability , author=. Advances in Neural Information Processing Systems , volume=
-
[34]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Adding conditional control to text-to-image diffusion models , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[35]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Infinite nature: Perpetual view generation of natural scenes from a single image , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[36]
Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
Scannet: Richly-annotated 3d reconstructions of indoor scenes , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
-
[37]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Learning the depths of moving people by watching frozen people , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[38]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
DreamPose: Fashion Video Synthesis with Stable Diffusion , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[39]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Mm-diffusion: Learning multi-modal diffusion models for joint audio and video generation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[41]
VideoCrafter1: Open Diffusion Models for High-Quality Video Generation , author=. 2023 , eprint=
work page 2023
-
[44]
Advances in Neural Information Processing Systems , volume=
Denoising diffusion probabilistic models , author=. Advances in Neural Information Processing Systems , volume=
-
[45]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Scalable diffusion models with transformers , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[46]
MagicVideo: Efficient Video Generation With Latent Diffusion Models
Magicvideo: Efficient video generation with latent diffusion models , author=. arXiv preprint arXiv:2211.11018 , year=
work page internal anchor Pith review arXiv
- [52]
-
[54]
arXiv preprint arXiv:2310.08465 , year=
Motiondirector: Motion customization of text-to-video diffusion models , author=. arXiv preprint arXiv:2310.08465 , year=
-
[60]
IEEE International Conference on Computer Vision (ICCV) , year=
Text2video-zero: Text-to-image diffusion models are zero-shot video generators , author=. IEEE International Conference on Computer Vision (ICCV) , year=
-
[61]
arXiv preprint arXiv:2305.04001 , year=
AADiff: Audio-Aligned Video Synthesis with Text-to-Image Diffusion , author=. arXiv preprint arXiv:2305.04001 , year=
-
[62]
arXiv preprint arXiv:2304.08551 , year=
Generative Disco: Text-to-Video Generation for Music Visualization , author=. arXiv preprint arXiv:2304.08551 , year=
-
[63]
Structure-from-Motion Revisited , booktitle=
Sch\". Structure-from-Motion Revisited , booktitle=
-
[64]
MovieFactory: Automatic Movie Creation from Text using Large Generative Models for Language and Images , author=. arXiv preprint arXiv:2306.07257 , year=
-
[65]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Conditional Image-to-Video Generation with Latent Flow Diffusion Models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
- [67]
-
[68]
SG 161222 civitai , title =
-
[69]
SO3 roration distance , howpublished =
Boris Belousov. SO3 roration distance , howpublished =
-
[71]
LAVIS : A One-stop Library for Language-Vision Intelligence
Li, Dongxu and Li, Junnan and Le, Hung and Wang, Guangsen and Savarese, Silvio and Hoi, Steven C.H. LAVIS : A One-stop Library for Language-Vision Intelligence. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations). 2023
work page 2023
-
[73]
The Twelfth International Conference on Learning Representations , year=
Seine: Short-to-long video diffusion model for generative transition and prediction , author=. The Twelfth International Conference on Learning Representations , year=
-
[74]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Structure and content-guided video synthesis with diffusion models , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[76]
https://openreview.net/forum?id=rylgEULtdN , year=
FVD: A new metric for video generation , author=. https://openreview.net/forum?id=rylgEULtdN , year=
-
[77]
International conference on machine learning , pages=
Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=
work page 2021
-
[79]
Advances in Neural Information Processing Systems , volume=
Elucidating the design space of diffusion-based generative models , author=. Advances in Neural Information Processing Systems , volume=
-
[80]
Compositional 3D Scene Generation using Locally Conditioned Diffusion , author=. ArXiv , year=
-
[81]
Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
-
[83]
Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks , author=. 2024 , eprint=
work page 2024
-
[84]
Raft: Recurrent all-pairs field transforms for optical flow , author=. Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part II 16 , pages=. 2020 , organization=
work page 2020
-
[85]
Advances in Neural Information Processing Systems , year=
CAT3D: Create Anything in 3D with Multi-View Diffusion Models , author=. Advances in Neural Information Processing Systems , year=
-
[87]
Training-free camera control for video generation.arXiv preprint arXiv:2406.10126, 2024
Training-free Camera Control for Video Generation , author=. arXiv preprint arXiv:2406.10126 , year=
-
[93]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Snap video: Scaled spatiotemporal transformers for text-to-video synthesis , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[95]
Zangwei Zheng and Xiangyu Peng and Tianji Yang and Chenhui Shen and Shenggui Li and Hongxin Liu and Yukun Zhou and Tianyi Li and Yang You , title =. 2024 , url =
work page 2024
-
[96]
Sherwin Bahmani, Ivan Skorokhodov, Aliaksandr Siarohin, Willi Menapace, Guocheng Qian, Michael Vasilkovsky, Hsin-Ying Lee, Chaoyang Wang, Jiaxu Zou, Andrea Tagliasacchi, et al. Vd3d: Taming large video diffusion transformers for 3d camera control. arXiv preprint arXiv:2407.12781, 2024
-
[97]
Frozen in time: A joint video and image encoder for end-to-end retrieval
Max Bain, Arsha Nagrani, G \"u l Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 1728--1738, 2021
work page 2021
-
[98]
arXiv preprint arXiv:2401.12945 , year =
Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Yuanzhen Li, Tomer Michaeli, et al. Lumiere: A space-time diffusion model for video generation. arXiv preprint arXiv:2401.12945, 2024
-
[99]
Boris Belousov. So3 roration distance. http://www.boris-belousov.net/2016/12/01/quat-dist/
work page 2016
-
[100]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023 a
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[101]
Align your latents: High-resolution video synthesis with latent diffusion models
Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 22563--22575, 2023 b
work page 2023
-
[102]
BradCatt. Toonyou. https://civitai.com/models/30240/toonyou
-
[103]
Video generation models as world simulators
Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. https://openai.com/index/video-generation-models-as-world-simulators/, 2024. URL https://openai.com/research/video-generation-models-as-world-...
work page 2024
-
[104]
Videocrafter1: Open diffusion models for high-quality video generation, 2023 a
Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter1: Open diffusion models for high-quality video generation, 2023 a
work page 2023
-
[105]
arXiv preprint arXiv:2304.14404 , year=
Tsai-Shien Chen, Chieh Hubert Lin, Hung-Yu Tseng, Tsung-Yi Lin, and Ming-Hsuan Yang. Motion-conditioned diffusion model for controllable video synthesis. arXiv preprint arXiv:2304.14404, 2023 b
-
[106]
arXiv preprint arXiv:2305.13840 (2023)
Weifeng Chen, Jie Wu, Pan Xie, Hefeng Wu, Jiashi Li, Xin Xia, Xuefeng Xiao, and Liang Lin. Control-a-video: Controllable text-to-video generation with diffusion models. arXiv preprint arXiv:2305.13840, 2023 c
-
[107]
Seine: Short-to-long video diffusion model for generative transition and prediction
Xinyuan Chen, Yaohui Wang, Lingjun Zhang, Shaobin Zhuang, Xin Ma, Jiashuo Yu, Yali Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Seine: Short-to-long video diffusion model for generative transition and prediction. In The Twelfth International Conference on Learning Representations, 2023 d
work page 2023
-
[108]
Boosting camera motion control for video diffusion transformers
Soon Yau Cheong, Duygu Ceylan, Armin Mustafa, Andrew Gilbert, and Chun-Hao Paul Huang. Boosting camera motion control for video diffusion transformers. arXiv preprint arXiv:2410.10802, 2024
-
[109]
SG 161222 civitai. Realistic vision. https://civitai.com/models/4201/realistic-vision-v60-b1
-
[110]
Objaverse: A universe of annotated 3d objects
Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 13142--13153, 2023
work page 2023
-
[111]
Structure and content-guided video synthesis with diffusion models
Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 7346--7356, 2023
work page 2023
-
[112]
arXiv preprint arXiv:2311.16933 , year=
Yuwei Guo, Ceyuan Yang, Anyi Rao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Sparsectrl: Adding sparse controls to text-to-video diffusion models. arXiv preprint arXiv:2311.16933, 2023 a
-
[113]
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Yu Qiao, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023 b
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[114]
Photorealistic video generation with diffusion models, 2023
Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Li Fei-Fei, Irfan Essa, Lu Jiang, and Jos \'e Lezama. Photorealistic video generation with diffusion models. arXiv preprint arXiv:2312.06662, 2023
-
[115]
Latent Video Diffusion Models for High-Fidelity Long Video Generation
Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv preprint arXiv:2211.13221, 2022
work page internal anchor Pith review arXiv 2022
-
[116]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[117]
Denoising diffusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33: 0 6840--6851, 2020
work page 2020
-
[118]
Imagen Video: High Definition Video Generation with Diffusion Models
Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022 a
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[119]
Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. arXiv:2204.03458, 2022 b
work page internal anchor Pith review arXiv 2022
-
[120]
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.