pith. machine review for the scientific record. sign in

arxiv: 2406.02509 · v1 · submitted 2024-06-04 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-16 19:39 UTC · model grok-4.3

classification 💻 cs.CV
keywords image-to-video generationcamera control3D consistencyepipolar attentionPlucker coordinatesvideo diffusionstructure-from-motion
0
0 comments X

The pith

CamCo adds precise camera pose control to image-to-video generation while enforcing 3D consistency across frames.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CamCo to let users specify exact camera movements when turning a single image into a video. It modifies a pre-trained diffusion model by supplying camera poses through Plücker coordinates and inserts an epipolar attention module that applies geometric constraints inside every attention block. Fine-tuning on real videos whose poses were recovered by structure-from-motion helps the model produce natural object motion alongside the controlled camera path. A sympathetic reader cares because most current video generators offer no reliable way to direct the camera, restricting cinematic expression and practical editing workflows.

Core claim

CamCo equips a pre-trained image-to-video diffusion model with Plücker-coordinate camera pose inputs and an epipolar attention module placed in each attention block that enforces epipolar constraints on the feature maps, then fine-tunes the resulting system on real-world videos whose poses were estimated by structure-from-motion algorithms, yielding videos that follow user-specified camera trajectories with improved 3D consistency and plausible object motion.

What carries the argument

Epipolar attention module that enforces geometric constraints on feature maps, combined with Plücker coordinate parameterization of camera poses.

Load-bearing premise

That the epipolar attention module will enforce 3D geometric consistency without introducing new artifacts or lowering visual quality, and that fine-tuning on SfM-estimated poses from real videos will transfer to arbitrary user-specified trajectories at inference time.

What would settle it

Generate videos under complex orbiting or dollying camera paths and check whether multi-view 3D reconstruction from the output frames recovers consistent object depths and positions, or shows systematic drift.

read the original abstract

Recently video diffusion models have emerged as expressive generative tools for high-quality video content creation readily available to general users. However, these models often do not offer precise control over camera poses for video generation, limiting the expression of cinematic language and user control. To address this issue, we introduce CamCo, which allows fine-grained Camera pose Control for image-to-video generation. We equip a pre-trained image-to-video generator with accurately parameterized camera pose input using Pl\"ucker coordinates. To enhance 3D consistency in the videos produced, we integrate an epipolar attention module in each attention block that enforces epipolar constraints to the feature maps. Additionally, we fine-tune CamCo on real-world videos with camera poses estimated through structure-from-motion algorithms to better synthesize object motion. Our experiments show that CamCo significantly improves 3D consistency and camera control capabilities compared to previous models while effectively generating plausible object motion. Project page: https://ir1d.github.io/CamCo/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces CamCo, a method for adding fine-grained camera pose control to pre-trained image-to-video diffusion models. It parameterizes camera input via Plücker coordinates, inserts an epipolar attention module into each attention block to enforce geometric consistency, and fine-tunes the model on real videos whose poses were recovered by structure-from-motion (SfM). The central claim is that these changes yield videos with measurably better 3D consistency and more accurate camera control while still producing plausible object motion.

Significance. If the quantitative claims hold, CamCo would provide a practical route to controllable cinematic video synthesis from a single image, addressing a clear limitation of current diffusion-based video generators. The epipolar-attention design is a lightweight way to inject 3D inductive bias without retraining from scratch, and the use of SfM poses for fine-tuning is a pragmatic data strategy. However, the absence of any numerical results, ablation tables, or evaluation protocol in the abstract makes it impossible to judge whether the improvements are substantial enough to shift the state of the art.

major comments (3)
  1. [Abstract] Abstract: the assertion that CamCo 'significantly improves 3D consistency and camera control capabilities' is unsupported by any quantitative metrics, ablation results, or description of how 3D consistency was measured (e.g., reprojection error, multi-view consistency scores, or user studies). Without these numbers the central empirical claim cannot be evaluated.
  2. [Method] Method section (camera-conditioning and fine-tuning): fine-tuning exclusively on SfM-estimated poses from real videos introduces a domain gap for arbitrary user-specified trajectories at inference. SfM poses contain noise, scale ambiguity, and are drawn from the distribution of handheld/tripod motion; no experiment tests generalization to out-of-distribution paths such as extreme dolly zooms or rapid pans outside the training support. This directly threatens the advertised 'camera control capabilities.'
  3. [Experiments] Experiments: no details are given on the evaluation protocol for 3D consistency (e.g., whether it uses ground-truth poses, multi-view reconstruction error, or optical-flow consistency), nor on the baselines, datasets, or statistical significance of the reported improvements.
minor comments (2)
  1. [Abstract] Abstract: the sentence 'effectively generating plausible object motion' is vague; clarify what motion quality metric or qualitative criterion is intended.
  2. [Method] Notation: Plücker coordinates are mentioned without an explicit definition or reference to the coordinate convention used; add a short equation or citation in the method section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to improve the clarity of our claims, evaluation details, and discussion of limitations.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the assertion that CamCo 'significantly improves 3D consistency and camera control capabilities' is unsupported by any quantitative metrics, ablation results, or description of how 3D consistency was measured (e.g., reprojection error, multi-view consistency scores, or user studies). Without these numbers the central empirical claim cannot be evaluated.

    Authors: We agree that the abstract would be strengthened by including key quantitative results. The experiments section reports specific metrics for 3D consistency (reprojection error and multi-view consistency scores) and camera control accuracy, along with comparisons to baselines. In the revised manuscript we will add a concise summary of these numerical improvements and a brief description of the evaluation metrics to the abstract. revision: yes

  2. Referee: [Method] Method section (camera-conditioning and fine-tuning): fine-tuning exclusively on SfM-estimated poses from real videos introduces a domain gap for arbitrary user-specified trajectories at inference. SfM poses contain noise, scale ambiguity, and are drawn from the distribution of handheld/tripod motion; no experiment tests generalization to out-of-distribution paths such as extreme dolly zooms or rapid pans outside the training support. This directly threatens the advertised 'camera control capabilities.'

    Authors: SfM-estimated poses from real videos are used because they provide realistic object motion that synthetic data cannot easily replicate. The epipolar attention module is intended to provide robustness to the noise and scale ambiguity inherent in SfM. While our current experiments cover a range of trajectories, we acknowledge the value of explicit OOD testing. We will add a new experiment subsection evaluating performance on extreme paths (e.g., rapid pans and dolly zooms) to better substantiate generalization claims. revision: yes

  3. Referee: [Experiments] Experiments: no details are given on the evaluation protocol for 3D consistency (e.g., whether it uses ground-truth poses, multi-view reconstruction error, or optical-flow consistency), nor on the baselines, datasets, or statistical significance of the reported improvements.

    Authors: We apologize for the insufficient detail in the current draft. Section 4 specifies that 3D consistency is measured via reprojection error against SfM ground-truth poses and multi-view consistency scores; baselines are evaluated on RealEstate10K and similar datasets; results are reported as means with standard deviations over repeated samples. We will expand the experiments section with a dedicated evaluation-protocol subsection that explicitly describes these elements, the datasets, and the statistical reporting. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper describes an engineering extension to a pre-trained image-to-video diffusion model: injecting Plücker coordinates as camera-pose conditioning and adding an epipolar attention module, followed by fine-tuning on SfM-estimated real-video poses. These are architectural and training choices whose outputs are evaluated empirically against baselines. No equations, uniqueness theorems, or self-citations are presented that would make any claimed prediction or consistency result equivalent to its own inputs by construction. The central claims rest on comparative experiments rather than tautological reductions, so the derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the pre-trained image-to-video diffusion backbone and on the validity of epipolar geometry for enforcing 3D consistency; no new free parameters or invented entities are introduced in the abstract.

axioms (1)
  • standard math Epipolar geometry provides valid constraints between corresponding points in different views of the same 3D scene
    Invoked to justify the epipolar attention module that enforces constraints on feature maps.

pith-pipeline@v0.9.0 · 5488 in / 1316 out tokens · 74956 ms · 2026-05-16T19:39:57.771538+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Foundation/AlexanderDuality alexander_duality_circle_linking echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    We equip a pre-trained image-to-video generator with accurately parameterized camera pose input using Plücker coordinates. To enhance 3D consistency... we integrate an epipolar attention module... that enforces epipolar constraints to the feature maps.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. PicoEyes: Unified Gaze Estimation Framework for Mixed Reality with a Large-Scale Multi-View Dataset

    cs.CV 2026-05 unverdicted novelty 7.0

    PicoEyes unifies gaze estimation for mixed reality by jointly predicting 3D eye parameters, segmentation, optical and visual axes, and depth maps from monocular or binocular inputs, supported by a new large-scale mult...

  2. PicoEyes: Unified Gaze Estimation Framework for Mixed Reality with a Large-Scale Multi-View Dataset

    cs.CV 2026-05 unverdicted novelty 7.0

    PicoEyes delivers a unified end-to-end model for full 3D gaze estimation including eye parameters, axes, segmentation and depth from monocular or binocular near-eye images, supported by a new large-scale multi-view dataset.

  3. Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting

    cs.CV 2026-04 unverdicted novelty 7.0

    Reshoot-Anything trains a diffusion transformer on pseudo multi-view triplets created by cropping and warping monocular videos to achieve temporally consistent video reshooting with robust camera control on dynamic scenes.

  4. UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models

    cs.CV 2026-04 unverdicted novelty 7.0

    UniGeo unifies geometric guidance across three levels in video models to reduce geometric drift and improve consistency in camera-controllable image editing.

  5. Geometrically Consistent Multi-View Scene Generation from Freehand Sketches

    cs.CV 2026-04 unverdicted novelty 7.0

    A framework generates consistent multi-view scenes from one freehand sketch via a ~9k-sample dataset, Parallel Camera-Aware Attention Adapters, and Sparse Correspondence Supervision Loss, outperforming baselines in re...

  6. OmniCamera: A Unified Framework for Multi-task Video Generation with Arbitrary Camera Control

    cs.CV 2026-04 unverdicted novelty 7.0

    OmniCamera disentangles video content and camera motion for multi-task generation with arbitrary camera control via the OmniCAM hybrid dataset and Dual-level Curriculum Co-Training.

  7. SparseCam4D: Spatio-Temporally Consistent 4D Reconstruction from Sparse Cameras

    cs.CV 2026-03 unverdicted novelty 7.0

    SparseCam4D achieves spatio-temporally consistent high-fidelity 4D reconstruction from sparse cameras via a Spatio-Temporal Distortion Field that corrects inconsistencies in generative observations.

  8. StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space

    cs.CV 2025-12 unverdicted novelty 7.0

    A viewpoint-conditioned diffusion model generates stereo image pairs from monocular input in a canonical rectified space without using depth or explicit warping.

  9. RealCam: Real-Time Novel-View Video Generation with Interactive Camera Control

    cs.CV 2026-05 unverdicted novelty 6.0

    RealCam is a causal autoregressive model for real-time camera-controlled video-to-video generation, using cross-frame in-context teacher distillation and loop-closed data augmentation to achieve high fidelity and consistency.

  10. PhyCo: Learning Controllable Physical Priors for Generative Motion

    cs.CV 2026-04 unverdicted novelty 6.0

    PhyCo adds continuous physical control to video diffusion models via physics-supervised fine-tuning on a large simulation dataset and VLM-guided rewards, yielding measurable gains in physical realism on the Physics-IQ...

  11. UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models

    cs.CV 2026-04 unverdicted novelty 6.0

    UniGeo adds unified geometric guidance at three levels in video models to reduce geometric drift and improve structural fidelity in camera-controllable image editing.

  12. Lyra 2.0: Explorable Generative 3D Worlds

    cs.CV 2026-04 unverdicted novelty 6.0

    Lyra 2.0 produces persistent 3D-consistent video sequences for large explorable worlds by using per-frame geometry for information routing and self-augmented training to correct temporal drift.

  13. Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories

    cs.CV 2026-04 unverdicted novelty 6.0

    A video diffusion model learns a joint distribution over videos and camera trajectories by representing cameras as pixel-aligned ray encodings (raxels) denoised jointly with video frames via decoupled attention.

  14. INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling

    cs.CV 2026-04 unverdicted novelty 6.0

    INSPATIO-WORLD is a real-time framework for high-fidelity 4D scene generation and navigation from monocular videos via STAR architecture with implicit caching, explicit geometric constraints, and distribution-matching...

  15. Pixel-to-4D: Camera-Controlled Image-to-Video Generation with Dynamic 3D Gaussians

    cs.CV 2026-01 unverdicted novelty 6.0

    A unified single-pass framework using dynamic 3D Gaussians generates temporally consistent camera-controlled videos from a single image.

  16. ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

    cs.CV 2024-09 unverdicted novelty 6.0

    ViewCrafter tames video diffusion models with point-based 3D guidance and iterative trajectory planning to produce high-fidelity novel views from single or sparse images.

  17. CameraCtrl: Enabling Camera Control for Text-to-Video Generation

    cs.CV 2024-04 unverdicted novelty 6.0

    CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.

  18. SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer

    cs.CV 2026-05 unverdicted novelty 5.0

    SANA-WM is a 2.6B-parameter efficient world model that synthesizes minute-scale 720p videos with 6-DoF camera control, trained on 213K public clips in 15 days on 64 H100s and runnable on single GPUs at 36x higher thro...

  19. Pose-Aware Diffusion for 3D Generation

    cs.CV 2026-05 unverdicted novelty 5.0

    PAD synthesizes 3D geometry in observation space via depth unprojection as anchor to eliminate pose ambiguity in image-to-3D generation.

  20. Cosmos World Foundation Model Platform for Physical AI

    cs.CV 2025-01 unverdicted novelty 3.0

    The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · cited by 18 Pith papers · 16 internal anchors

  1. [1]

    Stable video diffusion: Scaling latent video diffu- sion models to large datasets

    Stability AI. Stable video diffusion: Scaling latent video diffu- sion models to large datasets. https://stability.ai/research/ stable-video-diffusion-scaling-latent-video-diffusion-models-to-large-datasets , 2023

  2. [2]

    Frozen in time: A joint video and image encoder for end-to-end retrieval

    Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021

  3. [3]

    Improving image generation with better captions

    James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023

  4. [4]

    Align your latents: High-resolution video synthesis with latent diffusion models

    Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563–22575, 2023

  5. [5]

    Video generation models as world simulators

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024

  6. [6]

    Coyo-700m: Image-text pair dataset

    Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Sae- hoon Kim. Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/ coyo-dataset, 2022

  7. [7]

    pixelsplat: 3d gaus- sian splats from image pairs for scalable generalizable 3d reconstruction

    David Charatan, Sizhe Li, Andrea Tagliasacchi, and Vincent Sitzmann. pixelsplat: 3d gaus- sian splats from image pairs for scalable generalizable 3d reconstruction. arXiv preprint arXiv:2312.12337, 2023

  8. [8]

    VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

    Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, et al. Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512, 2023

  9. [9]

    Objaverse-xl: A universe of 10m+ 3d objects.arXiv preprint arXiv:2307.05663, 2023

    Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram V oleti, Samir Yitzhak Gadre, et al. Objaverse-xl: A universe of 10m+ 3d objects. arXiv preprint arXiv:2307.05663, 2023

  10. [10]

    Objaverse: A universe of annotated 3d objects

    Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13142–13153, 2023

  11. [11]

    Depth-supervised nerf: Fewer views and faster training for free

    Kangle Deng, Andrew Liu, Jun-Yan Zhu, and Deva Ramanan. Depth-supervised nerf: Fewer views and faster training for free. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12882–12891, 2022

  12. [12]

    Graphdreamer: Compositional 3d scene synthesis from scene graphs

    Gege Gao, Weiyang Liu, Anpei Chen, Andreas Geiger, and Bernhard Schölkopf. Graphdreamer: Compositional 3d scene synthesis from scene graphs. arXiv preprint arXiv:2312.00093, 2023

  13. [13]

    Sparsectrl: Adding sparse controls to text-to-video diffusion models

    Yuwei Guo, Ceyuan Yang, Anyi Rao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Sparsectrl: Adding sparse controls to text-to-video diffusion models. arXiv preprint arXiv:2311.16933, 2023

  14. [14]

    AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

    Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Yu Qiao, Dahua Lin, and Bo Dai. Ani- matediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023

  15. [15]

    CameraCtrl: Enabling Camera Control for Text-to-Video Generation

    Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation. arXiv preprint arXiv:2404.02101, 2024

  16. [16]

    Moving mirrors: a high-density eeg study investigating the effect of camera movements on motor cortex activation during action observation

    Katrin Heimann, Maria Alessandra Umiltà, Michele Guerra, and Vittorio Gallese. Moving mirrors: a high-density eeg study investigating the effect of camera movements on motor cortex activation during action observation. Journal of cognitive neuroscience, 26(9):2087–2101, 2014

  17. [17]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017. 10

  18. [18]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022

  19. [19]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021

  20. [20]

    Plücker coordinates for lines in the space

    Yan-Bin Jia. Plücker coordinates for lines in the space. Problem Solver Techniques for Applied Computer Science, Com-S-477/577 Course Handout, 3, 2020

  21. [21]

    Spad: Spatially aware multiview diffusers

    Yash Kant, Ziyi Wu, Michael Vasilkovsky, Guocheng Qian, Jian Ren, Riza Alp Guler, Bernard Ghanem, Sergey Tulyakov, Igor Gilitschenski, and Aliaksandr Siarohin. Spad: Spatially aware multiview diffusers. arXiv preprint arXiv:2402.05235, 2024

  22. [22]

    Elucidating the design space of diffusion-based generative models

    Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. Advances in Neural Information Processing Systems, 35:26565–26577, 2022

  23. [23]

    Text2video-zero: Text-to-image diffusion models are zero-shot video generators

    Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439, 2023

  24. [24]

    Gligen: Open-set grounded text-to-image generation

    Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22511–22521, 2023

  25. [25]

    Magic3d: High-resolution text-to-3d content creation

    Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 300–309, 2023

  26. [26]

    Zero-1-to-3: Zero-shot one image to 3d object

    Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl V ondrick. Zero-1-to-3: Zero-shot one image to 3d object. arXiv preprint arXiv:2303.11328, 2023

  27. [27]

    Latte: Latent Diffusion Transformer for Video Generation

    Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation. arXiv preprint arXiv:2401.03048, 2024

  28. [28]

    Im-3d: Iterative multiview diffusion and reconstruction for high-quality 3d generation

    Luke Melas-Kyriazi, Iro Laina, Christian Rupprecht, Natalia Neverova, Andrea Vedaldi, Oran Gafni, and Filippos Kokkinos. Im-3d: Iterative multiview diffusion and reconstruction for high-quality 3d generation. arXiv preprint arXiv:2402.08682, 2024

  29. [29]

    T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models

    Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, and Ying Shan. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 4296–4304, 2024

  30. [30]

    Camera movement in narrative cinema: towards a taxonomy of functions

    Jakob Isak Nielsen, Edvin Kau, and Richard Raskin. Camera movement in narrative cinema: towards a taxonomy of functions. Department of Inf. & Media Studies, University of Aarhus, 2007

  31. [31]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023

  32. [32]

    Vase: Object-centric appearance and shape manipulation of real videos

    Elia Peruzzo, Vidit Goel, Dejia Xu, Xingqian Xu, Yifan Jiang, Zhangyang Wang, Humphrey Shi, and Nicu Sebe. Vase: Object-centric appearance and shape manipulation of real videos. arXiv preprint arXiv:2401.02473, 2024

  33. [33]

    Compositional 3d scene generation using locally conditioned diffusion

    Ryan Po and Gordon Wetzstein. Compositional 3d scene generation using locally conditioned diffusion. arXiv preprint arXiv:2303.12218, 2023

  34. [34]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023

  35. [35]

    DreamFusion: Text-to-3D using 2D Diffusion

    Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022. 11

  36. [36]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022

  37. [37]

    Zero-shot text-to-image generation

    Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021

  38. [38]

    High- resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

  39. [39]

    Fleet, and Mohammad Norouzi

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Raphael Gontijo-Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David J. Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Ky...

  40. [40]

    Structure-from-motion revisited

    Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016

  41. [41]

    Pixelwise view selection for unstructured multi-view stereo

    Johannes Lutz Schönberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. Pixelwise view selection for unstructured multi-view stereo. In European Conference on Computer Vision (ECCV), 2016

  42. [42]

    LAION-5B: An open large-scale dataset for training next generation image-text models

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion- 5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022

  43. [43]

    Zero123++: a Single Image to Consistent Multi-view Diffusion Base Model

    Ruoxi Shi, Hansheng Chen, Zhuoyang Zhang, Minghua Liu, Chao Xu, Xinyue Wei, Linghao Chen, Chong Zeng, and Hao Su. Zero123++: a single image to consistent multi-view diffusion base model. arXiv preprint arXiv:2310.15110, 2023

  44. [44]

    Film studies: An introduction

    Ed Sikov. Film studies: An introduction. Columbia University Press, 2020

  45. [45]

    Light field networks: Neural scene representations with single-evaluation rendering

    Vincent Sitzmann, Semon Rezchikov, Bill Freeman, Josh Tenenbaum, and Fredo Durand. Light field networks: Neural scene representations with single-evaluation rendering. Advances in Neural Information Processing Systems, 34:19313–19325, 2021

  46. [46]

    DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation

    Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653, 2023

  47. [47]

    Consistent view synthesis with pose-guided diffusion models

    Hung-Yu Tseng, Qinbo Li, Changil Kim, Suhib Alsisan, Jia-Bin Huang, and Johannes Kopf. Consistent view synthesis with pose-guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16773–16783, 2023

  48. [48]

    Towards Accurate Generative Models of Video: A New Metric & Challenges

    Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018

  49. [49]

    Sv3d: Novel multi-view synthesis and 3d generation from a single image using latent video diffusion.arXiv preprint arXiv:2403.12008, 2024

    Vikram V oleti, Chun-Han Yao, Mark Boss, Adam Letts, David Pankratz, Dmitry Tochilkin, Christian Laforte, Robin Rombach, and Varun Jampani. Sv3d: Novel multi-view synthesis and 3d generation from a single image using latent video diffusion.arXiv preprint arXiv:2403.12008, 2024

  50. [50]

    Taming mode collapse in score distillation for text-to-3d generation

    Peihao Wang, Dejia Xu, Zhiwen Fan, Dilin Wang, Sreyas Mohan, Forrest Iandola, Rakesh Ranjan, Yilei Li, Qiang Liu, Zhangyang Wang, et al. Taming mode collapse in score distillation for text-to-3d generation. arXiv preprint arXiv:2401.00909, 2023

  51. [51]

    Motionctrl: A unified and flexible motion controller for video generation

    Zhouxia Wang, Ziyang Yuan, Xintao Wang, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. arXiv preprint arXiv:2312.03641, 2023

  52. [52]

    Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation

    Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7623–7633, 2023

  53. [53]

    Neurallift- 360: Lifting an in-the-wild 2d photo to a 3d object with 360 {\deg} views

    Dejia Xu, Yifan Jiang, Peihao Wang, Zhiwen Fan, Yi Wang, and Zhangyang Wang. Neurallift- 360: Lifting an in-the-wild 2d photo to a 3d object with 360 {\deg} views. arXiv preprint arXiv:2211.16431, 2022. 12

  54. [54]

    Comp4d: Llm-guided compositional 4d scene generation

    Dejia Xu, Hanwen Liang, Neel P Bhatt, Hezhen Hu, Hanxue Liang, Konstantinos N Plataniotis, and Zhangyang Wang. Comp4d: Llm-guided compositional 4d scene generation. arXiv preprint arXiv:2403.16993, 2024

  55. [55]

    VideoGPT: Video Generation using VQ-VAE and Transformers

    Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2021

  56. [56]

    An embodiment of the cinematographer: emotional and perceptual responses to different camera movement techniques

    Mehmet Burak Yilmaz, Elen Lotman, Andres Karjus, and Pia Tikka. An embodiment of the cinematographer: emotional and perceptual responses to different camera movement techniques. Frontiers in Neuroscience, 17:1160843, 2023

  57. [57]

    Magvit: Masked generative video transformer

    Lijun Yu, Yong Cheng, Kihyuk Sohn, José Lezama, Han Zhang, Huiwen Chang, Alexander G Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, et al. Magvit: Masked generative video transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10459–10469, 2023

  58. [58]

    Effi- cient video diffusion models via content-frame motion-latent decomposition

    Sihyun Yu, Weili Nie, De-An Huang, Boyi Li, Jinwoo Shin, and Anima Anandkumar. Effi- cient video diffusion models via content-frame motion-latent decomposition. arXiv preprint arXiv:2403.14148, 2024

  59. [59]

    Adding Conditional Control to Text-to-Image Diffusion Models

    Lvmin Zhang and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543, 2023

  60. [60]

    Scenewiz3d: Towards text-guided 3d scene composition

    Qihang Zhang, Chaoyang Wang, Aliaksandr Siarohin, Peiye Zhuang, Yinghao Xu, Ceyuan Yang, Dahua Lin, Bolei Zhou, Sergey Tulyakov, and Hsin-Ying Lee. Scenewiz3d: Towards text-guided 3d scene composition. arXiv preprint arXiv:2312.08885, 2023

  61. [61]

    Particlesfm: Exploiting dense point trajectories for localizing moving cameras in the wild

    Wang Zhao, Shaohui Liu, Hengkai Guo, Wenping Wang, and Yong-Jin Liu. Particlesfm: Exploiting dense point trajectories for localizing moving cameras in the wild. In European Conference on Computer Vision, pages 523–542. Springer, 2022

  62. [62]

    Stereo Magnification: Learning View Synthesis using Multiplane Images

    Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnifi- cation: Learning view synthesis using multiplane images. arXiv preprint arXiv:1805.09817, 2018

  63. [63]

    clean-fid

    Qi Zuo, Xiaodong Gu, Lingteng Qiu, Yuan Dong, Zhengyi Zhao, Weihao Yuan, Rui Peng, Siyu Zhu, Zilong Dong, Liefeng Bo, et al. Videomv: Consistent multi-view generation based on large video generative model. arXiv preprint arXiv:2403.12010, 2024. 13 A Additional Details on Epipolar Constraint Attention An epipolar line refers to the projection on one camera...