Recognition: 2 theorem links
· Lean TheoremCameraCtrl: Enabling Camera Control for Text-to-Video Generation
Pith reviewed 2026-05-13 02:00 UTC · model grok-4.3
The pith
A plug-and-play module adds precise camera pose control to existing text-to-video diffusion models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce CameraCtrl, enabling accurate camera pose control for video diffusion models. Our approach explores effective camera trajectory parameterization along with a plug-and-play camera pose control module that is trained on top of a video diffusion model, leaving other modules of the base model untouched. Moreover, a comprehensive study on the effect of various training datasets is conducted, suggesting that videos with diverse camera distributions and similar appearance to the base model indeed enhance controllability and generalization. Experimental results demonstrate the effectiveness of CameraCtrl in achieving precise camera control with different video generation models.
What carries the argument
The plug-and-play camera pose control module, which takes parameterized camera trajectories as input and injects the corresponding signals into a frozen video diffusion model during generation.
If this is right
- Precise camera control becomes available for multiple different video diffusion models without retraining them.
- Controllability and generalization improve when the training data includes wide ranges of camera paths and visual styles matching the base model.
- Users can combine text prompts with explicit camera pose sequences to produce videos that follow chosen cinematic movements.
Where Pith is reading between the lines
- The modular design opens the possibility of adding further independent control signals, such as object motion or lighting, on top of the same base model.
- This separation of concerns could support interactive applications where users adjust camera paths after an initial generation pass.
- The emphasis on dataset camera diversity suggests that future work might systematically catalog and release camera-annotated video collections to boost similar control methods.
Load-bearing premise
Videos with diverse camera distributions and appearance similar to the base model can be collected in sufficient quantity and the added control module will not reduce the base model's visual quality.
What would settle it
A side-by-side comparison where the generated video frames do not exhibit the exact camera motion specified in the input trajectory, or where image quality metrics fall below those of the unmodified base model.
read the original abstract
Controllability plays a crucial role in video generation, as it allows users to create and edit content more precisely. Existing models, however, lack control of camera pose that serves as a cinematic language to express deeper narrative nuances. To alleviate this issue, we introduce CameraCtrl, enabling accurate camera pose control for video diffusion models. Our approach explores effective camera trajectory parameterization along with a plug-and-play camera pose control module that is trained on top of a video diffusion model, leaving other modules of the base model untouched. Moreover, a comprehensive study on the effect of various training datasets is conducted, suggesting that videos with diverse camera distributions and similar appearance to the base model indeed enhance controllability and generalization. Experimental results demonstrate the effectiveness of CameraCtrl in achieving precise camera control with different video generation models, marking a step forward in the pursuit of dynamic and customized video storytelling from textual and camera pose inputs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CameraCtrl, a plug-and-play camera pose control module for text-to-video diffusion models. It proposes a camera trajectory parameterization, trains the module atop frozen base video diffusion models, and conducts a dataset study claiming that videos with diverse camera distributions and similar appearance to the base model improve controllability and generalization. Experiments are said to demonstrate accurate camera control across different video generation models from text and pose inputs.
Significance. If the central claims hold, this would represent a meaningful advance in controllable video generation by adding precise cinematic camera control without full model retraining. The plug-and-play design and dataset ablation study are positive elements that could facilitate adoption if supported by rigorous evidence of preserved base-model quality.
major comments (2)
- [Abstract] Abstract: The central claim that the control module enables accurate camera control 'leaving other modules of the base model untouched' is load-bearing, yet the abstract provides no quantitative metrics (FVD, CLIP score, or similar) comparing base-model quality before versus after module insertion and training. This omission prevents verification that distribution shift has not occurred.
- [Abstract] Abstract / dataset study: The claim that 'videos with diverse camera distributions and similar appearance to the base model indeed enhance controllability and generalization' is presented as a finding, but no ablation tables, quantitative controllability scores, or error analysis are referenced to support the cross-dataset conclusions.
minor comments (1)
- The camera trajectory parameterization is described at a high level; a dedicated subsection with explicit equations for pose encoding and injection into the diffusion process would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the thoughtful review and constructive suggestions. We address the major comments below and have revised the manuscript accordingly to strengthen the presentation of our results.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that the control module enables accurate camera control 'leaving other modules of the base model untouched' is load-bearing, yet the abstract provides no quantitative metrics (FVD, CLIP score, or similar) comparing base-model quality before versus after module insertion and training. This omission prevents verification that distribution shift has not occurred.
Authors: We agree that including quantitative metrics in the abstract would better support this central claim. In the revised manuscript, we will update the abstract to reference the FVD and CLIP score comparisons from our experiments, which demonstrate that the base model quality is largely preserved after inserting and training the control module. These metrics are reported in detail in Section 4.1 of the paper. revision: yes
-
Referee: [Abstract] Abstract / dataset study: The claim that 'videos with diverse camera distributions and similar appearance to the base model indeed enhance controllability and generalization' is presented as a finding, but no ablation tables, quantitative controllability scores, or error analysis are referenced to support the cross-dataset conclusions.
Authors: The dataset ablation study with quantitative results, including controllability scores and error analysis across different datasets, is provided in Section 4.3 with supporting tables. To address this comment, we will revise the abstract to more explicitly summarize the key quantitative findings from this study, such as improved generalization on diverse camera trajectories. This will help readers connect the claim to the evidence without requiring them to immediately consult the full text. revision: yes
Circularity Check
No circularity in claimed derivation or results
full rationale
The paper presents an empirical engineering contribution: a trainable plug-and-play control module inserted into a frozen video diffusion backbone and trained via standard supervised learning on external video datasets. No equations, predictions, or first-principles claims are offered that reduce to fitted parameters or self-citations by construction. The central statements (accurate camera control, dataset effects on generalization) are supported by experimental comparisons rather than definitional or self-referential loops. This is the common case of a self-contained applied ML paper.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption A separate control module can be trained to steer camera pose while leaving the base video diffusion model parameters untouched.
- domain assumption Dataset characteristics (diverse camera distributions and appearance similarity to base model) directly determine controllability and generalization.
Forward citations
Cited by 31 Pith papers
-
GTA: Advancing Image-to-3D World Generation via Geometry Then Appearance Video Diffusion
GTA generates 3D worlds from single images via a two-stage video diffusion process that prioritizes geometry before appearance to improve structural consistency.
-
MoCam: Unified Novel View Synthesis via Structured Denoising Dynamics
MoCam uses structured denoising dynamics in diffusion models to temporally decouple geometric alignment from appearance refinement, enabling unified novel view synthesis that outperforms prior methods on imperfect poi...
-
$h$-control: Training-Free Camera Control via Block-Conditional Gibbs Refinement
h-control introduces block-conditional pseudo-Gibbs refinement for training-free camera control in flow-matching video generators, achieving superior FVD scores on RealEstate10K and DAVIS benchmarks.
-
Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting
Reshoot-Anything trains a diffusion transformer on pseudo multi-view triplets created by cropping and warping monocular videos to achieve temporally consistent video reshooting with robust camera control on dynamic scenes.
-
WorldMark: A Unified Benchmark Suite for Interactive Video World Models
WorldMark is the first public benchmark that standardizes scenes, trajectories, and control interfaces across heterogeneous interactive image-to-video world models.
-
MultiWorld: Scalable Multi-Agent Multi-View Video World Models
MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.
-
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
UniGeo unifies geometric guidance across three levels in video models to reduce geometric drift and improve consistency in camera-controllable image editing.
-
Geometrically Consistent Multi-View Scene Generation from Freehand Sketches
A framework generates consistent multi-view scenes from one freehand sketch via a ~9k-sample dataset, Parallel Camera-Aware Attention Adapters, and Sparse Correspondence Supervision Loss, outperforming baselines in re...
-
Prompt Relay: Inference-Time Temporal Control for Multi-Event Video Generation
Prompt Relay is an inference-time plug-and-play method that penalizes cross-attention to enforce temporal prompt alignment and reduce semantic entanglement in multi-event video generation.
-
Novel View Synthesis as Video Completion
Video diffusion models can be adapted into permutation-invariant generators for sparse novel view synthesis by treating the problem as video completion and removing temporal order cues.
-
MotionScape: A Large-Scale Real-World Highly Dynamic UAV Video Dataset for World Models
MotionScape is a large-scale UAV video dataset with highly dynamic 6-DoF motions, geometric trajectories, and semantic annotations to train world models that better simulate complex 3D dynamics under large viewpoint changes.
-
MoRight: Motion Control Done Right
MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply ...
-
OmniCamera: A Unified Framework for Multi-task Video Generation with Arbitrary Camera Control
OmniCamera disentangles video content and camera motion for multi-task generation with arbitrary camera control via the OmniCAM hybrid dataset and Dual-level Curriculum Co-Training.
-
UniFixer: A Universal Reference-Guided Fixer for Diffusion-Based View Synthesis
UniFixer is a universal reference-guided framework that fixes spatial, temporal, and backbone-related degradations in diffusion-based view synthesis via coarse-to-fine modules and achieves zero-shot SOTA results on no...
-
RealCam: Real-Time Novel-View Video Generation with Interactive Camera Control
RealCam is a causal autoregressive model for real-time camera-controlled video-to-video generation, using cross-frame in-context teacher distillation and loop-closed data augmentation to achieve high fidelity and consistency.
-
Vista4D: Video Reshooting with 4D Point Clouds
Vista4D re-synthesizes dynamic videos from new viewpoints by grounding them in a 4D point cloud built with static segmentation and multiview training.
-
Memorize When Needed: Decoupled Memory Control for Spatially Consistent Long-Horizon Video Generation
A decoupled memory branch with hybrid cues, cross-attention, and gating improves spatial consistency and data efficiency in long-horizon camera-trajectory video generation.
-
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
UniGeo adds unified geometric guidance at three levels in video models to reduce geometric drift and improve structural fidelity in camera-controllable image editing.
-
From Synchrony to Sequence: Exo-to-Ego Generation via Interpolation
Interpolating exo and ego videos into a single continuous sequence lets diffusion sequence models generate more coherent first-person videos than direct conditioning, even without pose interpolation.
-
Lyra 2.0: Explorable Generative 3D Worlds
Lyra 2.0 produces persistent 3D-consistent video sequences for large explorable worlds by using per-frame geometry for information routing and self-augmented training to correct temporal drift.
-
Rein3D: Reinforced 3D Indoor Scene Generation with Panoramic Video Diffusion Models
Rein3D generates photorealistic, globally consistent 3D indoor scenes by using a restore-and-refine process where radial panoramic videos are restored via diffusion models and then used to update a 3D Gaussian field.
-
Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories
A video diffusion model learns a joint distribution over videos and camera trajectories by representing cameras as pixel-aligned ray encodings (raxels) denoised jointly with video frames via decoupled attention.
-
SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations
SceneScribe-1M is a new dataset of 1 million videos with semantic text, camera parameters, dense depth, and consistent 3D point tracks to support monocular depth estimation, scene reconstruction, point tracking, and t...
-
INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling
INSPATIO-WORLD is a real-time framework for high-fidelity 4D scene generation and navigation from monocular videos via STAR architecture with implicit caching, explicit geometric constraints, and distribution-matching...
-
SpatialEdit: Benchmarking Fine-Grained Image Spatial Editing
SpatialEdit provides a benchmark, large synthetic dataset, and baseline model for precise object and camera spatial manipulations in images, with the model beating priors on spatial editing.
-
VERTIGO: Visual Preference Optimization for Cinematic Camera Trajectory Generation
VERTIGO post-trains camera trajectory generators with visual preference signals from Unity-rendered previews scored by a cinematically fine-tuned VLM, cutting character off-screen rates from 38% to near zero while imp...
-
HVG-3D: Bridging Real and Simulation Domains for 3D-Conditional Hand-Object Interaction Video Synthesis
HVG-3D uses a 3D-aware diffusion architecture with ControlNet to synthesize high-fidelity hand-object interaction videos from 3D control signals, achieving state-of-the-art spatial fidelity and temporal coherence on t...
-
ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis
ViewCrafter tames video diffusion models with point-based 3D guidance and iterative trajectory planning to produce high-fidelity novel views from single or sparse images.
-
R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow
R-DMesh uses a VAE with a learned rectification jump offset and Triflow Attention inside a rectified-flow diffusion transformer to produce video-aligned 4D meshes despite initial pose misalignment.
-
World-R1: Reinforcing 3D Constraints for Text-to-Video Generation
World-R1 uses RL with 3D model feedback and a new text dataset to improve geometric consistency in text-to-video generation while keeping the base model unchanged.
-
Cosmos World Foundation Model Platform for Physical AI
The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.
Reference graph
Works this paper leans on
-
[3]
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
Glide: Towards photorealistic image generation and editing with text-guided diffusion models , author=. arXiv preprint arXiv:2112.10741 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Hierarchical text-conditional image generation with clip latents , author=. arXiv preprint arXiv:2204.06125 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Advances in Neural Information Processing Systems , volume=
Photorealistic text-to-image diffusion models with deep language understanding , author=. Advances in Neural Information Processing Systems , volume=
-
[6]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
High-resolution image synthesis with latent diffusion models , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[7]
ediff-i: Text-to-image diffusion models with ensemble of expert denoisers
ediffi: Text-to-image diffusion models with an ensemble of expert denoisers , author=. arXiv preprint arXiv:2211.01324 , year=
-
[8]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Vector quantized diffusion model for text-to-image synthesis , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[9]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Multi-concept customization of text-to-image diffusion , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[11]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Versatile diffusion: Text, images and variations all in one diffusion model , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[12]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[13]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Advancing high-resolution video-language representation with large-scale video transcriptions , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[14]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Frozen in time: A joint video and image encoder for end-to-end retrieval , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[16]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Align your latents: High-resolution video synthesis with latent diffusion models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[17]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[22]
Dreampose: Fashion image-to-video synthesis via stable diffusion,
Dreampose: Fashion image-to-video synthesis via stable diffusion , author=. arXiv preprint arXiv:2304.06025 , year=
-
[25]
Advances in Neural Information Processing Systems , volume=
Light field networks: Neural scene representations with single-evaluation rendering , author=. Advances in Neural Information Processing Systems , volume=
-
[26]
arXiv preprint arXiv:2304.13681 , year=
Ray Conditioning: Trading Photo-consistency for Photo-realism in Multi-view Image Generation , author=. arXiv preprint arXiv:2304.13681 , year=
-
[27]
arXiv preprint arXiv:2312.04551 , year=
Free3D: Consistent Novel View Synthesis without 3D Representation , author=. arXiv preprint arXiv:2312.04551 , year=
-
[29]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Objaverse: A universe of annotated 3d objects , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[30]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Mvimgnet: A large-scale dataset of multi-view images , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[32]
Advances in Neural Information Processing Systems , volume=
Videocomposer: Compositional video synthesis with motion controllability , author=. Advances in Neural Information Processing Systems , volume=
-
[34]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Adding conditional control to text-to-image diffusion models , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[35]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Infinite nature: Perpetual view generation of natural scenes from a single image , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[36]
Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
Scannet: Richly-annotated 3d reconstructions of indoor scenes , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
-
[37]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Learning the depths of moving people by watching frozen people , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[38]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
DreamPose: Fashion Video Synthesis with Stable Diffusion , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[39]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Mm-diffusion: Learning multi-modal diffusion models for joint audio and video generation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[41]
VideoCrafter1: Open Diffusion Models for High-Quality Video Generation , author=. 2023 , eprint=
work page 2023
-
[44]
Advances in Neural Information Processing Systems , volume=
Denoising diffusion probabilistic models , author=. Advances in Neural Information Processing Systems , volume=
-
[45]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Scalable diffusion models with transformers , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[46]
Magicvideo: Efficient video generation with latent diffusion models,
Magicvideo: Efficient video generation with latent diffusion models , author=. arXiv preprint arXiv:2211.11018 , year=
- [52]
-
[54]
arXiv preprint arXiv:2310.08465 , year=
Motiondirector: Motion customization of text-to-video diffusion models , author=. arXiv preprint arXiv:2310.08465 , year=
-
[60]
IEEE International Conference on Computer Vision (ICCV) , year=
Text2video-zero: Text-to-image diffusion models are zero-shot video generators , author=. IEEE International Conference on Computer Vision (ICCV) , year=
-
[61]
arXiv preprint arXiv:2305.04001 , year=
AADiff: Audio-Aligned Video Synthesis with Text-to-Image Diffusion , author=. arXiv preprint arXiv:2305.04001 , year=
-
[62]
arXiv preprint arXiv:2304.08551 , year=
Generative Disco: Text-to-Video Generation for Music Visualization , author=. arXiv preprint arXiv:2304.08551 , year=
-
[63]
Structure-from-Motion Revisited , booktitle=
Sch\". Structure-from-Motion Revisited , booktitle=
-
[64]
MovieFactory: Automatic Movie Creation from Text using Large Generative Models for Language and Images , author=. arXiv preprint arXiv:2306.07257 , year=
-
[65]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Conditional Image-to-Video Generation with Latent Flow Diffusion Models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
- [67]
-
[68]
SG 161222 civitai , title =
-
[69]
SO3 roration distance , howpublished =
Boris Belousov. SO3 roration distance , howpublished =
-
[71]
LAVIS : A One-stop Library for Language-Vision Intelligence
Li, Dongxu and Li, Junnan and Le, Hung and Wang, Guangsen and Savarese, Silvio and Hoi, Steven C.H. LAVIS : A One-stop Library for Language-Vision Intelligence. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations). 2023
work page 2023
-
[73]
The Twelfth International Conference on Learning Representations , year=
Seine: Short-to-long video diffusion model for generative transition and prediction , author=. The Twelfth International Conference on Learning Representations , year=
-
[74]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Structure and content-guided video synthesis with diffusion models , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[76]
https://openreview.net/forum?id=rylgEULtdN , year=
FVD: A new metric for video generation , author=. https://openreview.net/forum?id=rylgEULtdN , year=
-
[77]
International conference on machine learning , pages=
Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=
work page 2021
-
[79]
Advances in Neural Information Processing Systems , volume=
Elucidating the design space of diffusion-based generative models , author=. Advances in Neural Information Processing Systems , volume=
-
[80]
Compositional 3D Scene Generation using Locally Conditioned Diffusion , author=. ArXiv , year=
-
[81]
Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
-
[83]
Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks , author=. 2024 , eprint=
work page 2024
-
[84]
Raft: Recurrent all-pairs field transforms for optical flow , author=. Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part II 16 , pages=. 2020 , organization=
work page 2020
-
[85]
Advances in Neural Information Processing Systems , year=
CAT3D: Create Anything in 3D with Multi-View Diffusion Models , author=. Advances in Neural Information Processing Systems , year=
-
[87]
Training-free camera control for video generation.arXiv preprint arXiv:2406.10126, 2024
Training-free Camera Control for Video Generation , author=. arXiv preprint arXiv:2406.10126 , year=
-
[93]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Snap video: Scaled spatiotemporal transformers for text-to-video synthesis , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[95]
Zangwei Zheng and Xiangyu Peng and Tianji Yang and Chenhui Shen and Shenggui Li and Hongxin Liu and Yukun Zhou and Tianyi Li and Yang You , title =. 2024 , url =
work page 2024
-
[96]
arXiv preprint arXiv:2407.12781 , year=
Sherwin Bahmani, Ivan Skorokhodov, Aliaksandr Siarohin, Willi Menapace, Guocheng Qian, Michael Vasilkovsky, Hsin-Ying Lee, Chaoyang Wang, Jiaxu Zou, Andrea Tagliasacchi, et al. Vd3d: Taming large video diffusion transformers for 3d camera control. arXiv preprint arXiv:2407.12781, 2024
-
[97]
Frozen in time: A joint video and image encoder for end-to-end retrieval
Max Bain, Arsha Nagrani, G \"u l Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 1728--1738, 2021
work page 2021
-
[98]
Lumiere: A space-time diffusion model for video generation,
Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Yuanzhen Li, Tomer Michaeli, et al. Lumiere: A space-time diffusion model for video generation. arXiv preprint arXiv:2401.12945, 2024
-
[99]
Boris Belousov. So3 roration distance. http://www.boris-belousov.net/2016/12/01/quat-dist/
work page 2016
-
[100]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023 a
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[101]
Align your latents: High-resolution video synthesis with latent diffusion models
Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 22563--22575, 2023 b
work page 2023
-
[102]
BradCatt. Toonyou. https://civitai.com/models/30240/toonyou
-
[103]
Video generation models as world simulators
Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. https://openai.com/index/video-generation-models-as-world-simulators/, 2024. URL https://openai.com/research/video-generation-models-as-world-...
work page 2024
-
[104]
Videocrafter1: Open diffusion models for high-quality video generation, 2023 a
Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter1: Open diffusion models for high-quality video generation, 2023 a
work page 2023
-
[105]
Motion-conditioned diffusion model for controllable video synthesis
Tsai-Shien Chen, Chieh Hubert Lin, Hung-Yu Tseng, Tsung-Yi Lin, and Ming-Hsuan Yang. Motion-conditioned diffusion model for controllable video synthesis. arXiv preprint arXiv:2304.14404, 2023 b
-
[106]
arXiv preprint arXiv:2305.13840 (2023)
Weifeng Chen, Jie Wu, Pan Xie, Hefeng Wu, Jiashi Li, Xin Xia, Xuefeng Xiao, and Liang Lin. Control-a-video: Controllable text-to-video generation with diffusion models. arXiv preprint arXiv:2305.13840, 2023 c
-
[107]
Seine: Short-to-long video diffusion model for generative transition and prediction
Xinyuan Chen, Yaohui Wang, Lingjun Zhang, Shaobin Zhuang, Xin Ma, Jiashuo Yu, Yali Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Seine: Short-to-long video diffusion model for generative transition and prediction. In The Twelfth International Conference on Learning Representations, 2023 d
work page 2023
-
[108]
Boosting camera motion control for video diffusion transformers
Soon Yau Cheong, Duygu Ceylan, Armin Mustafa, Andrew Gilbert, and Chun-Hao Paul Huang. Boosting camera motion control for video diffusion transformers. arXiv preprint arXiv:2410.10802, 2024
-
[109]
SG 161222 civitai. Realistic vision. https://civitai.com/models/4201/realistic-vision-v60-b1
-
[110]
Objaverse: A universe of annotated 3d objects
Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 13142--13153, 2023
work page 2023
-
[111]
Structure and content-guided video synthesis with diffusion models
Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 7346--7356, 2023
work page 2023
-
[112]
Sparsectrl: Adding sparse controls to text-to-video diffusion models
Yuwei Guo, Ceyuan Yang, Anyi Rao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Sparsectrl: Adding sparse controls to text-to-video diffusion models. arXiv preprint arXiv:2311.16933, 2023 a
-
[113]
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Yu Qiao, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023 b
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[114]
Photorealistic video generation with diffusion models,
Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Li Fei-Fei, Irfan Essa, Lu Jiang, and Jos \'e Lezama. Photorealistic video generation with diffusion models. arXiv preprint arXiv:2312.06662, 2023
-
[115]
Latent video diffusion models for high-fidelity video generation with arbitrary lengths,
Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv preprint arXiv:2211.13221, 2022
-
[116]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[117]
Denoising diffusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33: 0 6840--6851, 2020
work page 2020
-
[118]
Imagen Video: High Definition Video Generation with Diffusion Models
Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022 a
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[119]
Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. arXiv:2204.03458, 2022 b
work page internal anchor Pith review arXiv 2022
-
[120]
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.