pith. machine review for the scientific record. sign in

arxiv: 2404.02101 · v2 · submitted 2024-04-02 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

CameraCtrl: Enabling Camera Control for Text-to-Video Generation

Bo Dai, Ceyuan Yang, Gordon Wetzstein, Hao He, Hongsheng Li, Yinghao Xu, Yuwei Guo

Pith reviewed 2026-05-13 02:00 UTC · model grok-4.3

classification 💻 cs.CV
keywords text-to-video generationcamera pose controlvideo diffusion modelsplug-and-play modulecamera trajectorycinematic videocontrollable generation
0
0 comments X

The pith

A plug-and-play module adds precise camera pose control to existing text-to-video diffusion models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CameraCtrl to give text-to-video AI models control over camera movements such as pans, tilts, and zooms. Current diffusion-based video generators produce content from text but cannot follow user-specified camera paths, which limits their use for narrative or cinematic results. The solution adds a separate control module trained on video data that already contains varied camera trajectories while keeping the original model weights unchanged. Experiments show that datasets with wide camera variety and visual styles close to the base model improve both accuracy and how well the control transfers to new prompts. This setup lets users supply both text and a camera trajectory to generate videos with directed motion.

Core claim

We introduce CameraCtrl, enabling accurate camera pose control for video diffusion models. Our approach explores effective camera trajectory parameterization along with a plug-and-play camera pose control module that is trained on top of a video diffusion model, leaving other modules of the base model untouched. Moreover, a comprehensive study on the effect of various training datasets is conducted, suggesting that videos with diverse camera distributions and similar appearance to the base model indeed enhance controllability and generalization. Experimental results demonstrate the effectiveness of CameraCtrl in achieving precise camera control with different video generation models.

What carries the argument

The plug-and-play camera pose control module, which takes parameterized camera trajectories as input and injects the corresponding signals into a frozen video diffusion model during generation.

If this is right

  • Precise camera control becomes available for multiple different video diffusion models without retraining them.
  • Controllability and generalization improve when the training data includes wide ranges of camera paths and visual styles matching the base model.
  • Users can combine text prompts with explicit camera pose sequences to produce videos that follow chosen cinematic movements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The modular design opens the possibility of adding further independent control signals, such as object motion or lighting, on top of the same base model.
  • This separation of concerns could support interactive applications where users adjust camera paths after an initial generation pass.
  • The emphasis on dataset camera diversity suggests that future work might systematically catalog and release camera-annotated video collections to boost similar control methods.

Load-bearing premise

Videos with diverse camera distributions and appearance similar to the base model can be collected in sufficient quantity and the added control module will not reduce the base model's visual quality.

What would settle it

A side-by-side comparison where the generated video frames do not exhibit the exact camera motion specified in the input trajectory, or where image quality metrics fall below those of the unmodified base model.

read the original abstract

Controllability plays a crucial role in video generation, as it allows users to create and edit content more precisely. Existing models, however, lack control of camera pose that serves as a cinematic language to express deeper narrative nuances. To alleviate this issue, we introduce CameraCtrl, enabling accurate camera pose control for video diffusion models. Our approach explores effective camera trajectory parameterization along with a plug-and-play camera pose control module that is trained on top of a video diffusion model, leaving other modules of the base model untouched. Moreover, a comprehensive study on the effect of various training datasets is conducted, suggesting that videos with diverse camera distributions and similar appearance to the base model indeed enhance controllability and generalization. Experimental results demonstrate the effectiveness of CameraCtrl in achieving precise camera control with different video generation models, marking a step forward in the pursuit of dynamic and customized video storytelling from textual and camera pose inputs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces CameraCtrl, a plug-and-play camera pose control module for text-to-video diffusion models. It proposes a camera trajectory parameterization, trains the module atop frozen base video diffusion models, and conducts a dataset study claiming that videos with diverse camera distributions and similar appearance to the base model improve controllability and generalization. Experiments are said to demonstrate accurate camera control across different video generation models from text and pose inputs.

Significance. If the central claims hold, this would represent a meaningful advance in controllable video generation by adding precise cinematic camera control without full model retraining. The plug-and-play design and dataset ablation study are positive elements that could facilitate adoption if supported by rigorous evidence of preserved base-model quality.

major comments (2)
  1. [Abstract] Abstract: The central claim that the control module enables accurate camera control 'leaving other modules of the base model untouched' is load-bearing, yet the abstract provides no quantitative metrics (FVD, CLIP score, or similar) comparing base-model quality before versus after module insertion and training. This omission prevents verification that distribution shift has not occurred.
  2. [Abstract] Abstract / dataset study: The claim that 'videos with diverse camera distributions and similar appearance to the base model indeed enhance controllability and generalization' is presented as a finding, but no ablation tables, quantitative controllability scores, or error analysis are referenced to support the cross-dataset conclusions.
minor comments (1)
  1. The camera trajectory parameterization is described at a high level; a dedicated subsection with explicit equations for pose encoding and injection into the diffusion process would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and constructive suggestions. We address the major comments below and have revised the manuscript accordingly to strengthen the presentation of our results.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that the control module enables accurate camera control 'leaving other modules of the base model untouched' is load-bearing, yet the abstract provides no quantitative metrics (FVD, CLIP score, or similar) comparing base-model quality before versus after module insertion and training. This omission prevents verification that distribution shift has not occurred.

    Authors: We agree that including quantitative metrics in the abstract would better support this central claim. In the revised manuscript, we will update the abstract to reference the FVD and CLIP score comparisons from our experiments, which demonstrate that the base model quality is largely preserved after inserting and training the control module. These metrics are reported in detail in Section 4.1 of the paper. revision: yes

  2. Referee: [Abstract] Abstract / dataset study: The claim that 'videos with diverse camera distributions and similar appearance to the base model indeed enhance controllability and generalization' is presented as a finding, but no ablation tables, quantitative controllability scores, or error analysis are referenced to support the cross-dataset conclusions.

    Authors: The dataset ablation study with quantitative results, including controllability scores and error analysis across different datasets, is provided in Section 4.3 with supporting tables. To address this comment, we will revise the abstract to more explicitly summarize the key quantitative findings from this study, such as improved generalization on diverse camera trajectories. This will help readers connect the claim to the evidence without requiring them to immediately consult the full text. revision: yes

Circularity Check

0 steps flagged

No circularity in claimed derivation or results

full rationale

The paper presents an empirical engineering contribution: a trainable plug-and-play control module inserted into a frozen video diffusion backbone and trained via standard supervised learning on external video datasets. No equations, predictions, or first-principles claims are offered that reduce to fitted parameters or self-citations by construction. The central statements (accurate camera control, dataset effects on generalization) are supported by experimental comparisons rather than definitional or self-referential loops. This is the common case of a self-contained applied ML paper.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that camera pose can be decoupled from appearance and content generation in diffusion models, and that training data properties (camera diversity and appearance similarity) causally improve control without side effects.

axioms (2)
  • domain assumption A separate control module can be trained to steer camera pose while leaving the base video diffusion model parameters untouched.
    Stated in the abstract as the training strategy for the plug-and-play module.
  • domain assumption Dataset characteristics (diverse camera distributions and appearance similarity to base model) directly determine controllability and generalization.
    Presented as the outcome of the comprehensive study on training datasets.

pith-pipeline@v0.9.0 · 5464 in / 1242 out tokens · 41730 ms · 2026-05-13T02:00:07.850613+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 31 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. GTA: Advancing Image-to-3D World Generation via Geometry Then Appearance Video Diffusion

    cs.CV 2026-05 unverdicted novelty 7.0

    GTA generates 3D worlds from single images via a two-stage video diffusion process that prioritizes geometry before appearance to improve structural consistency.

  2. MoCam: Unified Novel View Synthesis via Structured Denoising Dynamics

    cs.CV 2026-05 unverdicted novelty 7.0

    MoCam uses structured denoising dynamics in diffusion models to temporally decouple geometric alignment from appearance refinement, enabling unified novel view synthesis that outperforms prior methods on imperfect poi...

  3. $h$-control: Training-Free Camera Control via Block-Conditional Gibbs Refinement

    cs.CV 2026-05 unverdicted novelty 7.0

    h-control introduces block-conditional pseudo-Gibbs refinement for training-free camera control in flow-matching video generators, achieving superior FVD scores on RealEstate10K and DAVIS benchmarks.

  4. Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting

    cs.CV 2026-04 unverdicted novelty 7.0

    Reshoot-Anything trains a diffusion transformer on pseudo multi-view triplets created by cropping and warping monocular videos to achieve temporally consistent video reshooting with robust camera control on dynamic scenes.

  5. WorldMark: A Unified Benchmark Suite for Interactive Video World Models

    cs.CV 2026-04 unverdicted novelty 7.0

    WorldMark is the first public benchmark that standardizes scenes, trajectories, and control interfaces across heterogeneous interactive image-to-video world models.

  6. MultiWorld: Scalable Multi-Agent Multi-View Video World Models

    cs.CV 2026-04 unverdicted novelty 7.0

    MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.

  7. UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models

    cs.CV 2026-04 unverdicted novelty 7.0

    UniGeo unifies geometric guidance across three levels in video models to reduce geometric drift and improve consistency in camera-controllable image editing.

  8. Geometrically Consistent Multi-View Scene Generation from Freehand Sketches

    cs.CV 2026-04 unverdicted novelty 7.0

    A framework generates consistent multi-view scenes from one freehand sketch via a ~9k-sample dataset, Parallel Camera-Aware Attention Adapters, and Sparse Correspondence Supervision Loss, outperforming baselines in re...

  9. Prompt Relay: Inference-Time Temporal Control for Multi-Event Video Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    Prompt Relay is an inference-time plug-and-play method that penalizes cross-attention to enforce temporal prompt alignment and reduce semantic entanglement in multi-event video generation.

  10. Novel View Synthesis as Video Completion

    cs.CV 2026-04 unverdicted novelty 7.0

    Video diffusion models can be adapted into permutation-invariant generators for sparse novel view synthesis by treating the problem as video completion and removing temporal order cues.

  11. MotionScape: A Large-Scale Real-World Highly Dynamic UAV Video Dataset for World Models

    cs.CV 2026-04 unverdicted novelty 7.0

    MotionScape is a large-scale UAV video dataset with highly dynamic 6-DoF motions, geometric trajectories, and semantic annotations to train world models that better simulate complex 3D dynamics under large viewpoint changes.

  12. MoRight: Motion Control Done Right

    cs.CV 2026-04 unverdicted novelty 7.0

    MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply ...

  13. OmniCamera: A Unified Framework for Multi-task Video Generation with Arbitrary Camera Control

    cs.CV 2026-04 unverdicted novelty 7.0

    OmniCamera disentangles video content and camera motion for multi-task generation with arbitrary camera control via the OmniCAM hybrid dataset and Dual-level Curriculum Co-Training.

  14. UniFixer: A Universal Reference-Guided Fixer for Diffusion-Based View Synthesis

    cs.CV 2026-05 unverdicted novelty 6.0

    UniFixer is a universal reference-guided framework that fixes spatial, temporal, and backbone-related degradations in diffusion-based view synthesis via coarse-to-fine modules and achieves zero-shot SOTA results on no...

  15. RealCam: Real-Time Novel-View Video Generation with Interactive Camera Control

    cs.CV 2026-05 unverdicted novelty 6.0

    RealCam is a causal autoregressive model for real-time camera-controlled video-to-video generation, using cross-frame in-context teacher distillation and loop-closed data augmentation to achieve high fidelity and consistency.

  16. Vista4D: Video Reshooting with 4D Point Clouds

    cs.CV 2026-04 unverdicted novelty 6.0

    Vista4D re-synthesizes dynamic videos from new viewpoints by grounding them in a 4D point cloud built with static segmentation and multiview training.

  17. Memorize When Needed: Decoupled Memory Control for Spatially Consistent Long-Horizon Video Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    A decoupled memory branch with hybrid cues, cross-attention, and gating improves spatial consistency and data efficiency in long-horizon camera-trajectory video generation.

  18. UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models

    cs.CV 2026-04 unverdicted novelty 6.0

    UniGeo adds unified geometric guidance at three levels in video models to reduce geometric drift and improve structural fidelity in camera-controllable image editing.

  19. From Synchrony to Sequence: Exo-to-Ego Generation via Interpolation

    cs.CV 2026-04 unverdicted novelty 6.0

    Interpolating exo and ego videos into a single continuous sequence lets diffusion sequence models generate more coherent first-person videos than direct conditioning, even without pose interpolation.

  20. Lyra 2.0: Explorable Generative 3D Worlds

    cs.CV 2026-04 unverdicted novelty 6.0

    Lyra 2.0 produces persistent 3D-consistent video sequences for large explorable worlds by using per-frame geometry for information routing and self-augmented training to correct temporal drift.

  21. Rein3D: Reinforced 3D Indoor Scene Generation with Panoramic Video Diffusion Models

    cs.CV 2026-04 unverdicted novelty 6.0

    Rein3D generates photorealistic, globally consistent 3D indoor scenes by using a restore-and-refine process where radial panoramic videos are restored via diffusion models and then used to update a 3D Gaussian field.

  22. Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories

    cs.CV 2026-04 unverdicted novelty 6.0

    A video diffusion model learns a joint distribution over videos and camera trajectories by representing cameras as pixel-aligned ray encodings (raxels) denoised jointly with video frames via decoupled attention.

  23. SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations

    cs.CV 2026-04 unverdicted novelty 6.0

    SceneScribe-1M is a new dataset of 1 million videos with semantic text, camera parameters, dense depth, and consistent 3D point tracks to support monocular depth estimation, scene reconstruction, point tracking, and t...

  24. INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling

    cs.CV 2026-04 unverdicted novelty 6.0

    INSPATIO-WORLD is a real-time framework for high-fidelity 4D scene generation and navigation from monocular videos via STAR architecture with implicit caching, explicit geometric constraints, and distribution-matching...

  25. SpatialEdit: Benchmarking Fine-Grained Image Spatial Editing

    cs.CV 2026-04 unverdicted novelty 6.0

    SpatialEdit provides a benchmark, large synthetic dataset, and baseline model for precise object and camera spatial manipulations in images, with the model beating priors on spatial editing.

  26. VERTIGO: Visual Preference Optimization for Cinematic Camera Trajectory Generation

    cs.CV 2026-04 conditional novelty 6.0

    VERTIGO post-trains camera trajectory generators with visual preference signals from Unity-rendered previews scored by a cinematically fine-tuned VLM, cutting character off-screen rates from 38% to near zero while imp...

  27. HVG-3D: Bridging Real and Simulation Domains for 3D-Conditional Hand-Object Interaction Video Synthesis

    cs.CV 2026-03 unverdicted novelty 6.0

    HVG-3D uses a 3D-aware diffusion architecture with ControlNet to synthesize high-fidelity hand-object interaction videos from 3D control signals, achieving state-of-the-art spatial fidelity and temporal coherence on t...

  28. ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

    cs.CV 2024-09 unverdicted novelty 6.0

    ViewCrafter tames video diffusion models with point-based 3D guidance and iterative trajectory planning to produce high-fidelity novel views from single or sparse images.

  29. R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow

    cs.CV 2026-05 unverdicted novelty 5.0

    R-DMesh uses a VAE with a learned rectification jump offset and Triflow Attention inside a rectified-flow diffusion transformer to produce video-aligned 4D meshes despite initial pose misalignment.

  30. World-R1: Reinforcing 3D Constraints for Text-to-Video Generation

    cs.CV 2026-04 unverdicted novelty 4.0

    World-R1 uses RL with 3D model feedback and a new text dataset to improve geometric consistency in text-to-video generation while keeping the base model unchanged.

  31. Cosmos World Foundation Model Platform for Physical AI

    cs.CV 2025-01 unverdicted novelty 3.0

    The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.

Reference graph

Works this paper leans on

130 extracted references · 130 canonical work pages · cited by 30 Pith papers · 17 internal anchors

  1. [3]

    GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

    Glide: Towards photorealistic image generation and editing with text-guided diffusion models , author=. arXiv preprint arXiv:2112.10741 , year=

  2. [4]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Hierarchical text-conditional image generation with clip latents , author=. arXiv preprint arXiv:2204.06125 , year=

  3. [5]

    Advances in Neural Information Processing Systems , volume=

    Photorealistic text-to-image diffusion models with deep language understanding , author=. Advances in Neural Information Processing Systems , volume=

  4. [6]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    High-resolution image synthesis with latent diffusion models , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  5. [7]

    ediff-i: Text-to-image diffusion models with ensemble of expert denoisers

    ediffi: Text-to-image diffusion models with an ensemble of expert denoisers , author=. arXiv preprint arXiv:2211.01324 , year=

  6. [8]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Vector quantized diffusion model for text-to-image synthesis , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  7. [9]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Multi-concept customization of text-to-image diffusion , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  8. [11]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Versatile diffusion: Text, images and variations all in one diffusion model , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  9. [12]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  10. [13]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Advancing high-resolution video-language representation with large-scale video transcriptions , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  11. [14]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Frozen in time: A joint video and image encoder for end-to-end retrieval , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  12. [16]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Align your latents: High-resolution video synthesis with latent diffusion models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  13. [17]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  14. [22]

    Dreampose: Fashion image-to-video synthesis via stable diffusion,

    Dreampose: Fashion image-to-video synthesis via stable diffusion , author=. arXiv preprint arXiv:2304.06025 , year=

  15. [25]

    Advances in Neural Information Processing Systems , volume=

    Light field networks: Neural scene representations with single-evaluation rendering , author=. Advances in Neural Information Processing Systems , volume=

  16. [26]

    arXiv preprint arXiv:2304.13681 , year=

    Ray Conditioning: Trading Photo-consistency for Photo-realism in Multi-view Image Generation , author=. arXiv preprint arXiv:2304.13681 , year=

  17. [27]

    arXiv preprint arXiv:2312.04551 , year=

    Free3D: Consistent Novel View Synthesis without 3D Representation , author=. arXiv preprint arXiv:2312.04551 , year=

  18. [29]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Objaverse: A universe of annotated 3d objects , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  19. [30]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Mvimgnet: A large-scale dataset of multi-view images , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  20. [32]

    Advances in Neural Information Processing Systems , volume=

    Videocomposer: Compositional video synthesis with motion controllability , author=. Advances in Neural Information Processing Systems , volume=

  21. [34]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Adding conditional control to text-to-image diffusion models , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  22. [35]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Infinite nature: Perpetual view generation of natural scenes from a single image , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  23. [36]

    Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

    Scannet: Richly-annotated 3d reconstructions of indoor scenes , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

  24. [37]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Learning the depths of moving people by watching frozen people , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  25. [38]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    DreamPose: Fashion Video Synthesis with Stable Diffusion , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  26. [39]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Mm-diffusion: Learning multi-modal diffusion models for joint audio and video generation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  27. [41]

    2023 , eprint=

    VideoCrafter1: Open Diffusion Models for High-Quality Video Generation , author=. 2023 , eprint=

  28. [44]

    Advances in Neural Information Processing Systems , volume=

    Denoising diffusion probabilistic models , author=. Advances in Neural Information Processing Systems , volume=

  29. [45]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Scalable diffusion models with transformers , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  30. [46]

    Magicvideo: Efficient video generation with latent diffusion models,

    Magicvideo: Efficient video generation with latent diffusion models , author=. arXiv preprint arXiv:2211.11018 , year=

  31. [52]

    2024 , url=

    Video generation models as world simulators , author=. 2024 , url=

  32. [54]

    arXiv preprint arXiv:2310.08465 , year=

    Motiondirector: Motion customization of text-to-video diffusion models , author=. arXiv preprint arXiv:2310.08465 , year=

  33. [60]

    IEEE International Conference on Computer Vision (ICCV) , year=

    Text2video-zero: Text-to-image diffusion models are zero-shot video generators , author=. IEEE International Conference on Computer Vision (ICCV) , year=

  34. [61]

    arXiv preprint arXiv:2305.04001 , year=

    AADiff: Audio-Aligned Video Synthesis with Text-to-Image Diffusion , author=. arXiv preprint arXiv:2305.04001 , year=

  35. [62]

    arXiv preprint arXiv:2304.08551 , year=

    Generative Disco: Text-to-Video Generation for Music Visualization , author=. arXiv preprint arXiv:2304.08551 , year=

  36. [63]

    Structure-from-Motion Revisited , booktitle=

    Sch\". Structure-from-Motion Revisited , booktitle=

  37. [64]

    Moviefactory: Automatic movie creation from text using large generative models for language and images,

    MovieFactory: Automatic Movie Creation from Text using Large Generative Models for Language and Images , author=. arXiv preprint arXiv:2306.07257 , year=

  38. [65]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Conditional Image-to-Video Generation with Latent Flow Diffusion Models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  39. [67]

    ToonYou , howpublished =

    BradCatt. ToonYou , howpublished =

  40. [68]

    SG 161222 civitai , title =

  41. [69]

    SO3 roration distance , howpublished =

    Boris Belousov. SO3 roration distance , howpublished =

  42. [71]

    LAVIS : A One-stop Library for Language-Vision Intelligence

    Li, Dongxu and Li, Junnan and Le, Hung and Wang, Guangsen and Savarese, Silvio and Hoi, Steven C.H. LAVIS : A One-stop Library for Language-Vision Intelligence. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations). 2023

  43. [73]

    The Twelfth International Conference on Learning Representations , year=

    Seine: Short-to-long video diffusion model for generative transition and prediction , author=. The Twelfth International Conference on Learning Representations , year=

  44. [74]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Structure and content-guided video synthesis with diffusion models , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  45. [76]

    https://openreview.net/forum?id=rylgEULtdN , year=

    FVD: A new metric for video generation , author=. https://openreview.net/forum?id=rylgEULtdN , year=

  46. [77]

    International conference on machine learning , pages=

    Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

  47. [79]

    Advances in Neural Information Processing Systems , volume=

    Elucidating the design space of diffusion-based generative models , author=. Advances in Neural Information Processing Systems , volume=

  48. [80]

    ArXiv , year=

    Compositional 3D Scene Generation using Locally Conditioned Diffusion , author=. ArXiv , year=

  49. [81]

    Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

    Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

  50. [83]

    2024 , eprint=

    Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks , author=. 2024 , eprint=

  51. [84]

    Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part II 16 , pages=

    Raft: Recurrent all-pairs field transforms for optical flow , author=. Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part II 16 , pages=. 2020 , organization=

  52. [85]

    Advances in Neural Information Processing Systems , year=

    CAT3D: Create Anything in 3D with Multi-View Diffusion Models , author=. Advances in Neural Information Processing Systems , year=

  53. [87]

    Training-free camera control for video generation.arXiv preprint arXiv:2406.10126, 2024

    Training-free Camera Control for Video Generation , author=. arXiv preprint arXiv:2406.10126 , year=

  54. [93]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Snap video: Scaled spatiotemporal transformers for text-to-video synthesis , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  55. [95]

    2024 , url =

    Zangwei Zheng and Xiangyu Peng and Tianji Yang and Chenhui Shen and Shenggui Li and Hongxin Liu and Yukun Zhou and Tianyi Li and Yang You , title =. 2024 , url =

  56. [96]

    arXiv preprint arXiv:2407.12781 , year=

    Sherwin Bahmani, Ivan Skorokhodov, Aliaksandr Siarohin, Willi Menapace, Guocheng Qian, Michael Vasilkovsky, Hsin-Ying Lee, Chaoyang Wang, Jiaxu Zou, Andrea Tagliasacchi, et al. Vd3d: Taming large video diffusion transformers for 3d camera control. arXiv preprint arXiv:2407.12781, 2024

  57. [97]

    Frozen in time: A joint video and image encoder for end-to-end retrieval

    Max Bain, Arsha Nagrani, G \"u l Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 1728--1738, 2021

  58. [98]

    Lumiere: A space-time diffusion model for video generation,

    Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Yuanzhen Li, Tomer Michaeli, et al. Lumiere: A space-time diffusion model for video generation. arXiv preprint arXiv:2401.12945, 2024

  59. [99]

    So3 roration distance

    Boris Belousov. So3 roration distance. http://www.boris-belousov.net/2016/12/01/quat-dist/

  60. [100]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023 a

  61. [101]

    Align your latents: High-resolution video synthesis with latent diffusion models

    Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 22563--22575, 2023 b

  62. [102]

    BradCatt. Toonyou. https://civitai.com/models/30240/toonyou

  63. [103]

    Video generation models as world simulators

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. https://openai.com/index/video-generation-models-as-world-simulators/, 2024. URL https://openai.com/research/video-generation-models-as-world-...

  64. [104]

    Videocrafter1: Open diffusion models for high-quality video generation, 2023 a

    Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter1: Open diffusion models for high-quality video generation, 2023 a

  65. [105]

    Motion-conditioned diffusion model for controllable video synthesis

    Tsai-Shien Chen, Chieh Hubert Lin, Hung-Yu Tseng, Tsung-Yi Lin, and Ming-Hsuan Yang. Motion-conditioned diffusion model for controllable video synthesis. arXiv preprint arXiv:2304.14404, 2023 b

  66. [106]

    arXiv preprint arXiv:2305.13840 (2023)

    Weifeng Chen, Jie Wu, Pan Xie, Hefeng Wu, Jiashi Li, Xin Xia, Xuefeng Xiao, and Liang Lin. Control-a-video: Controllable text-to-video generation with diffusion models. arXiv preprint arXiv:2305.13840, 2023 c

  67. [107]

    Seine: Short-to-long video diffusion model for generative transition and prediction

    Xinyuan Chen, Yaohui Wang, Lingjun Zhang, Shaobin Zhuang, Xin Ma, Jiashuo Yu, Yali Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Seine: Short-to-long video diffusion model for generative transition and prediction. In The Twelfth International Conference on Learning Representations, 2023 d

  68. [108]

    Boosting camera motion control for video diffusion transformers

    Soon Yau Cheong, Duygu Ceylan, Armin Mustafa, Andrew Gilbert, and Chun-Hao Paul Huang. Boosting camera motion control for video diffusion transformers. arXiv preprint arXiv:2410.10802, 2024

  69. [109]

    Realistic vision

    SG 161222 civitai. Realistic vision. https://civitai.com/models/4201/realistic-vision-v60-b1

  70. [110]

    Objaverse: A universe of annotated 3d objects

    Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 13142--13153, 2023

  71. [111]

    Structure and content-guided video synthesis with diffusion models

    Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 7346--7356, 2023

  72. [112]

    Sparsectrl: Adding sparse controls to text-to-video diffusion models

    Yuwei Guo, Ceyuan Yang, Anyi Rao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Sparsectrl: Adding sparse controls to text-to-video diffusion models. arXiv preprint arXiv:2311.16933, 2023 a

  73. [113]

    AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

    Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Yu Qiao, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023 b

  74. [114]

    Photorealistic video generation with diffusion models,

    Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Li Fei-Fei, Irfan Essa, Lu Jiang, and Jos \'e Lezama. Photorealistic video generation with diffusion models. arXiv preprint arXiv:2312.06662, 2023

  75. [115]

    Latent video diffusion models for high-fidelity video generation with arbitrary lengths,

    Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv preprint arXiv:2211.13221, 2022

  76. [116]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022

  77. [117]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33: 0 6840--6851, 2020

  78. [118]

    Imagen Video: High Definition Video Generation with Diffusion Models

    Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022 a

  79. [119]

    Video Diffusion Models

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. arXiv:2204.03458, 2022 b

  80. [120]

    CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

    Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022

Showing first 80 references.