arxiv: 2406.02509 · v1 · submitted 2024-06-04 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation

Dejia Xu , Weili Nie , Chao Liu , Sifei Liu , Jan Kautz , Zhangyang Wang , Arash Vahdat

Authors on Pith no claims yet

Pith reviewed 2026-05-16 19:39 UTC · model grok-4.3

classification 💻 cs.CV

keywords image-to-video generationcamera control3D consistencyepipolar attentionPlucker coordinatesvideo diffusionstructure-from-motion

0 comments

The pith

CamCo adds precise camera pose control to image-to-video generation while enforcing 3D consistency across frames.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CamCo to let users specify exact camera movements when turning a single image into a video. It modifies a pre-trained diffusion model by supplying camera poses through Plücker coordinates and inserts an epipolar attention module that applies geometric constraints inside every attention block. Fine-tuning on real videos whose poses were recovered by structure-from-motion helps the model produce natural object motion alongside the controlled camera path. A sympathetic reader cares because most current video generators offer no reliable way to direct the camera, restricting cinematic expression and practical editing workflows.

Core claim

CamCo equips a pre-trained image-to-video diffusion model with Plücker-coordinate camera pose inputs and an epipolar attention module placed in each attention block that enforces epipolar constraints on the feature maps, then fine-tunes the resulting system on real-world videos whose poses were estimated by structure-from-motion algorithms, yielding videos that follow user-specified camera trajectories with improved 3D consistency and plausible object motion.

What carries the argument

Epipolar attention module that enforces geometric constraints on feature maps, combined with Plücker coordinate parameterization of camera poses.

Load-bearing premise

That the epipolar attention module will enforce 3D geometric consistency without introducing new artifacts or lowering visual quality, and that fine-tuning on SfM-estimated poses from real videos will transfer to arbitrary user-specified trajectories at inference time.

What would settle it

Generate videos under complex orbiting or dollying camera paths and check whether multi-view 3D reconstruction from the output frames recovers consistent object depths and positions, or shows systematic drift.

read the original abstract

Recently video diffusion models have emerged as expressive generative tools for high-quality video content creation readily available to general users. However, these models often do not offer precise control over camera poses for video generation, limiting the expression of cinematic language and user control. To address this issue, we introduce CamCo, which allows fine-grained Camera pose Control for image-to-video generation. We equip a pre-trained image-to-video generator with accurately parameterized camera pose input using Pl\"ucker coordinates. To enhance 3D consistency in the videos produced, we integrate an epipolar attention module in each attention block that enforces epipolar constraints to the feature maps. Additionally, we fine-tune CamCo on real-world videos with camera poses estimated through structure-from-motion algorithms to better synthesize object motion. Our experiments show that CamCo significantly improves 3D consistency and camera control capabilities compared to previous models while effectively generating plausible object motion. Project page: https://ir1d.github.io/CamCo/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CamCo adds Plücker camera input and epipolar attention inside a video diffusion backbone then fine-tunes on SfM poses, but the quantitative case for real 3D consistency gains is still thin.

read the letter

The main thing here is that CamCo conditions a pre-trained image-to-video diffusion model on camera poses via Plücker coordinates, inserts an epipolar attention block into every attention layer to enforce geometric constraints, and fine-tunes the whole thing on real videos whose poses come from structure-from-motion. That specific combination inside the diffusion stack is the concrete new piece; earlier camera-conditioning papers used different injection points or lacked the epipolar module repeated across layers. The fine-tuning step itself is standard, but applying it to get plausible object motion alongside camera control is a sensible practical choice. The paper keeps the method lightweight by starting from an existing backbone, which is the right engineering move. The architecture description is straightforward and the motivation for cinematic control is clear. The soft spot is the evidence. The abstract and early sections claim clear wins on 3D consistency and controllability, yet the numbers, ablations, and exact evaluation protocol for consistency (pose error, multi-view coherence, etc.) are not laid out strongly enough to judge the size of the improvement. The stress-test worry about SfM noise and domain gap to arbitrary test trajectories also lands: if the model mostly fits the distribution of typical handheld or tripod motions in the fine-tuning set, extreme user-specified paths could still produce drift or artifacts. No obvious internal contradiction in the equations, but the central claim would be stronger with a direct test of out-of-distribution camera paths. This is for people building controllable video generators or downstream tools that need camera steering. A reader already working on diffusion video models would pick up the conditioning trick quickly. I would send it to peer review. The idea is concrete and the problem is real, so referees can push on the metrics and generalization experiments without starting from zero.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces CamCo, a method for adding fine-grained camera pose control to pre-trained image-to-video diffusion models. It parameterizes camera input via Plücker coordinates, inserts an epipolar attention module into each attention block to enforce geometric consistency, and fine-tunes the model on real videos whose poses were recovered by structure-from-motion (SfM). The central claim is that these changes yield videos with measurably better 3D consistency and more accurate camera control while still producing plausible object motion.

Significance. If the quantitative claims hold, CamCo would provide a practical route to controllable cinematic video synthesis from a single image, addressing a clear limitation of current diffusion-based video generators. The epipolar-attention design is a lightweight way to inject 3D inductive bias without retraining from scratch, and the use of SfM poses for fine-tuning is a pragmatic data strategy. However, the absence of any numerical results, ablation tables, or evaluation protocol in the abstract makes it impossible to judge whether the improvements are substantial enough to shift the state of the art.

major comments (3)

[Abstract] Abstract: the assertion that CamCo 'significantly improves 3D consistency and camera control capabilities' is unsupported by any quantitative metrics, ablation results, or description of how 3D consistency was measured (e.g., reprojection error, multi-view consistency scores, or user studies). Without these numbers the central empirical claim cannot be evaluated.
[Method] Method section (camera-conditioning and fine-tuning): fine-tuning exclusively on SfM-estimated poses from real videos introduces a domain gap for arbitrary user-specified trajectories at inference. SfM poses contain noise, scale ambiguity, and are drawn from the distribution of handheld/tripod motion; no experiment tests generalization to out-of-distribution paths such as extreme dolly zooms or rapid pans outside the training support. This directly threatens the advertised 'camera control capabilities.'
[Experiments] Experiments: no details are given on the evaluation protocol for 3D consistency (e.g., whether it uses ground-truth poses, multi-view reconstruction error, or optical-flow consistency), nor on the baselines, datasets, or statistical significance of the reported improvements.

minor comments (2)

[Abstract] Abstract: the sentence 'effectively generating plausible object motion' is vague; clarify what motion quality metric or qualitative criterion is intended.
[Method] Notation: Plücker coordinates are mentioned without an explicit definition or reference to the coordinate convention used; add a short equation or citation in the method section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to improve the clarity of our claims, evaluation details, and discussion of limitations.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that CamCo 'significantly improves 3D consistency and camera control capabilities' is unsupported by any quantitative metrics, ablation results, or description of how 3D consistency was measured (e.g., reprojection error, multi-view consistency scores, or user studies). Without these numbers the central empirical claim cannot be evaluated.

Authors: We agree that the abstract would be strengthened by including key quantitative results. The experiments section reports specific metrics for 3D consistency (reprojection error and multi-view consistency scores) and camera control accuracy, along with comparisons to baselines. In the revised manuscript we will add a concise summary of these numerical improvements and a brief description of the evaluation metrics to the abstract. revision: yes
Referee: [Method] Method section (camera-conditioning and fine-tuning): fine-tuning exclusively on SfM-estimated poses from real videos introduces a domain gap for arbitrary user-specified trajectories at inference. SfM poses contain noise, scale ambiguity, and are drawn from the distribution of handheld/tripod motion; no experiment tests generalization to out-of-distribution paths such as extreme dolly zooms or rapid pans outside the training support. This directly threatens the advertised 'camera control capabilities.'

Authors: SfM-estimated poses from real videos are used because they provide realistic object motion that synthetic data cannot easily replicate. The epipolar attention module is intended to provide robustness to the noise and scale ambiguity inherent in SfM. While our current experiments cover a range of trajectories, we acknowledge the value of explicit OOD testing. We will add a new experiment subsection evaluating performance on extreme paths (e.g., rapid pans and dolly zooms) to better substantiate generalization claims. revision: yes
Referee: [Experiments] Experiments: no details are given on the evaluation protocol for 3D consistency (e.g., whether it uses ground-truth poses, multi-view reconstruction error, or optical-flow consistency), nor on the baselines, datasets, or statistical significance of the reported improvements.

Authors: We apologize for the insufficient detail in the current draft. Section 4 specifies that 3D consistency is measured via reprojection error against SfM ground-truth poses and multi-view consistency scores; baselines are evaluated on RealEstate10K and similar datasets; results are reported as means with standard deviations over repeated samples. We will expand the experiments section with a dedicated evaluation-protocol subsection that explicitly describes these elements, the datasets, and the statistical reporting. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper describes an engineering extension to a pre-trained image-to-video diffusion model: injecting Plücker coordinates as camera-pose conditioning and adding an epipolar attention module, followed by fine-tuning on SfM-estimated real-video poses. These are architectural and training choices whose outputs are evaluated empirically against baselines. No equations, uniqueness theorems, or self-citations are presented that would make any claimed prediction or consistency result equivalent to its own inputs by construction. The central claims rest on comparative experiments rather than tautological reductions, so the derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the pre-trained image-to-video diffusion backbone and on the validity of epipolar geometry for enforcing 3D consistency; no new free parameters or invented entities are introduced in the abstract.

axioms (1)

standard math Epipolar geometry provides valid constraints between corresponding points in different views of the same 3D scene
Invoked to justify the epipolar attention module that enforces constraints on feature maps.

pith-pipeline@v0.9.0 · 5488 in / 1316 out tokens · 74956 ms · 2026-05-16T19:39:57.771538+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation/AlexanderDuality alexander_duality_circle_linking echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

We equip a pre-trained image-to-video generator with accurately parameterized camera pose input using Plücker coordinates. To enhance 3D consistency... we integrate an epipolar attention module... that enforces epipolar constraints to the feature maps.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PicoEyes: Unified Gaze Estimation Framework for Mixed Reality with a Large-Scale Multi-View Dataset
cs.CV 2026-05 unverdicted novelty 7.0

PicoEyes unifies gaze estimation for mixed reality by jointly predicting 3D eye parameters, segmentation, optical and visual axes, and depth maps from monocular or binocular inputs, supported by a new large-scale mult...
PicoEyes: Unified Gaze Estimation Framework for Mixed Reality with a Large-Scale Multi-View Dataset
cs.CV 2026-05 unverdicted novelty 7.0

PicoEyes delivers a unified end-to-end model for full 3D gaze estimation including eye parameters, axes, segmentation and depth from monocular or binocular near-eye images, supported by a new large-scale multi-view dataset.
Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting
cs.CV 2026-04 unverdicted novelty 7.0

Reshoot-Anything trains a diffusion transformer on pseudo multi-view triplets created by cropping and warping monocular videos to achieve temporally consistent video reshooting with robust camera control on dynamic scenes.
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
cs.CV 2026-04 unverdicted novelty 7.0

UniGeo unifies geometric guidance across three levels in video models to reduce geometric drift and improve consistency in camera-controllable image editing.
Geometrically Consistent Multi-View Scene Generation from Freehand Sketches
cs.CV 2026-04 unverdicted novelty 7.0

A framework generates consistent multi-view scenes from one freehand sketch via a ~9k-sample dataset, Parallel Camera-Aware Attention Adapters, and Sparse Correspondence Supervision Loss, outperforming baselines in re...
OmniCamera: A Unified Framework for Multi-task Video Generation with Arbitrary Camera Control
cs.CV 2026-04 unverdicted novelty 7.0

OmniCamera disentangles video content and camera motion for multi-task generation with arbitrary camera control via the OmniCAM hybrid dataset and Dual-level Curriculum Co-Training.
SparseCam4D: Spatio-Temporally Consistent 4D Reconstruction from Sparse Cameras
cs.CV 2026-03 unverdicted novelty 7.0

SparseCam4D achieves spatio-temporally consistent high-fidelity 4D reconstruction from sparse cameras via a Spatio-Temporal Distortion Field that corrects inconsistencies in generative observations.
StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space
cs.CV 2025-12 unverdicted novelty 7.0

A viewpoint-conditioned diffusion model generates stereo image pairs from monocular input in a canonical rectified space without using depth or explicit warping.
RealCam: Real-Time Novel-View Video Generation with Interactive Camera Control
cs.CV 2026-05 unverdicted novelty 6.0

RealCam is a causal autoregressive model for real-time camera-controlled video-to-video generation, using cross-frame in-context teacher distillation and loop-closed data augmentation to achieve high fidelity and consistency.
PhyCo: Learning Controllable Physical Priors for Generative Motion
cs.CV 2026-04 unverdicted novelty 6.0

PhyCo adds continuous physical control to video diffusion models via physics-supervised fine-tuning on a large simulation dataset and VLM-guided rewards, yielding measurable gains in physical realism on the Physics-IQ...
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
cs.CV 2026-04 unverdicted novelty 6.0

UniGeo adds unified geometric guidance at three levels in video models to reduce geometric drift and improve structural fidelity in camera-controllable image editing.
Lyra 2.0: Explorable Generative 3D Worlds
cs.CV 2026-04 unverdicted novelty 6.0

Lyra 2.0 produces persistent 3D-consistent video sequences for large explorable worlds by using per-frame geometry for information routing and self-augmented training to correct temporal drift.
Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories
cs.CV 2026-04 unverdicted novelty 6.0

A video diffusion model learns a joint distribution over videos and camera trajectories by representing cameras as pixel-aligned ray encodings (raxels) denoised jointly with video frames via decoupled attention.
INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling
cs.CV 2026-04 unverdicted novelty 6.0

INSPATIO-WORLD is a real-time framework for high-fidelity 4D scene generation and navigation from monocular videos via STAR architecture with implicit caching, explicit geometric constraints, and distribution-matching...
Pixel-to-4D: Camera-Controlled Image-to-Video Generation with Dynamic 3D Gaussians
cs.CV 2026-01 unverdicted novelty 6.0

A unified single-pass framework using dynamic 3D Gaussians generates temporally consistent camera-controlled videos from a single image.
ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis
cs.CV 2024-09 unverdicted novelty 6.0

ViewCrafter tames video diffusion models with point-based 3D guidance and iterative trajectory planning to produce high-fidelity novel views from single or sparse images.
CameraCtrl: Enabling Camera Control for Text-to-Video Generation
cs.CV 2024-04 unverdicted novelty 6.0

CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.
SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer
cs.CV 2026-05 unverdicted novelty 5.0

SANA-WM is a 2.6B-parameter efficient world model that synthesizes minute-scale 720p videos with 6-DoF camera control, trained on 213K public clips in 15 days on 64 H100s and runnable on single GPUs at 36x higher thro...
Pose-Aware Diffusion for 3D Generation
cs.CV 2026-05 unverdicted novelty 5.0

PAD synthesizes 3D geometry in observation space via depth unprojection as anchor to eliminate pose ambiguity in image-to-3D generation.
Cosmos World Foundation Model Platform for Physical AI
cs.CV 2025-01 unverdicted novelty 3.0

The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · cited by 18 Pith papers · 16 internal anchors

[1]

Stable video diffusion: Scaling latent video diffu- sion models to large datasets

Stability AI. Stable video diffusion: Scaling latent video diffu- sion models to large datasets. https://stability.ai/research/ stable-video-diffusion-scaling-latent-video-diffusion-models-to-large-datasets , 2023

work page 2023
[2]

Frozen in time: A joint video and image encoder for end-to-end retrieval

Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021

work page 2021
[3]

Improving image generation with better captions

James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023

work page 2023
[4]

Align your latents: High-resolution video synthesis with latent diffusion models

Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563–22575, 2023

work page 2023
[5]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024

work page 2024
[6]

Coyo-700m: Image-text pair dataset

Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Sae- hoon Kim. Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/ coyo-dataset, 2022

work page 2022
[7]

pixelsplat: 3d gaus- sian splats from image pairs for scalable generalizable 3d reconstruction

David Charatan, Sizhe Li, Andrea Tagliasacchi, and Vincent Sitzmann. pixelsplat: 3d gaus- sian splats from image pairs for scalable generalizable 3d reconstruction. arXiv preprint arXiv:2312.12337, 2023

work page arXiv 2023
[8]

VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, et al. Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Objaverse-xl: A universe of 10m+ 3d objects.arXiv preprint arXiv:2307.05663, 2023

Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram V oleti, Samir Yitzhak Gadre, et al. Objaverse-xl: A universe of 10m+ 3d objects. arXiv preprint arXiv:2307.05663, 2023

work page arXiv 2023
[10]

Objaverse: A universe of annotated 3d objects

Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13142–13153, 2023

work page 2023
[11]

Depth-supervised nerf: Fewer views and faster training for free

Kangle Deng, Andrew Liu, Jun-Yan Zhu, and Deva Ramanan. Depth-supervised nerf: Fewer views and faster training for free. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12882–12891, 2022

work page 2022
[12]

Graphdreamer: Compositional 3d scene synthesis from scene graphs

Gege Gao, Weiyang Liu, Anpei Chen, Andreas Geiger, and Bernhard Schölkopf. Graphdreamer: Compositional 3d scene synthesis from scene graphs. arXiv preprint arXiv:2312.00093, 2023

work page arXiv 2023
[13]

Sparsectrl: Adding sparse controls to text-to-video diffusion models

Yuwei Guo, Ceyuan Yang, Anyi Rao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Sparsectrl: Adding sparse controls to text-to-video diffusion models. arXiv preprint arXiv:2311.16933, 2023

work page arXiv 2023
[14]

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Yu Qiao, Dahua Lin, and Bo Dai. Ani- matediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

CameraCtrl: Enabling Camera Control for Text-to-Video Generation

Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation. arXiv preprint arXiv:2404.02101, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Moving mirrors: a high-density eeg study investigating the effect of camera movements on motor cortex activation during action observation

Katrin Heimann, Maria Alessandra Umiltà, Michele Guerra, and Vittorio Gallese. Moving mirrors: a high-density eeg study investigating the effect of camera movements on motor cortex activation during action observation. Journal of cognitive neuroscience, 26(9):2087–2101, 2014

work page 2087
[17]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017. 10

work page 2017
[18]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[19]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[20]

Plücker coordinates for lines in the space

Yan-Bin Jia. Plücker coordinates for lines in the space. Problem Solver Techniques for Applied Computer Science, Com-S-477/577 Course Handout, 3, 2020

work page 2020
[21]

Spad: Spatially aware multiview diffusers

Yash Kant, Ziyi Wu, Michael Vasilkovsky, Guocheng Qian, Jian Ren, Riza Alp Guler, Bernard Ghanem, Sergey Tulyakov, Igor Gilitschenski, and Aliaksandr Siarohin. Spad: Spatially aware multiview diffusers. arXiv preprint arXiv:2402.05235, 2024

work page arXiv 2024
[22]

Elucidating the design space of diffusion-based generative models

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. Advances in Neural Information Processing Systems, 35:26565–26577, 2022

work page 2022
[23]

Text2video-zero: Text-to-image diffusion models are zero-shot video generators

Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439, 2023

work page arXiv 2023
[24]

Gligen: Open-set grounded text-to-image generation

Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22511–22521, 2023

work page 2023
[25]

Magic3d: High-resolution text-to-3d content creation

Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 300–309, 2023

work page 2023
[26]

Zero-1-to-3: Zero-shot one image to 3d object

Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl V ondrick. Zero-1-to-3: Zero-shot one image to 3d object. arXiv preprint arXiv:2303.11328, 2023

work page arXiv 2023
[27]

Latte: Latent Diffusion Transformer for Video Generation

Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation. arXiv preprint arXiv:2401.03048, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Im-3d: Iterative multiview diffusion and reconstruction for high-quality 3d generation

Luke Melas-Kyriazi, Iro Laina, Christian Rupprecht, Natalia Neverova, Andrea Vedaldi, Oran Gafni, and Filippos Kokkinos. Im-3d: Iterative multiview diffusion and reconstruction for high-quality 3d generation. arXiv preprint arXiv:2402.08682, 2024

work page arXiv 2024
[29]

T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models

Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, and Ying Shan. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 4296–4304, 2024

work page 2024
[30]

Camera movement in narrative cinema: towards a taxonomy of functions

Jakob Isak Nielsen, Edvin Kau, and Richard Raskin. Camera movement in narrative cinema: towards a taxonomy of functions. Department of Inf. & Media Studies, University of Aarhus, 2007

work page 2007
[31]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023

work page 2023
[32]

Vase: Object-centric appearance and shape manipulation of real videos

Elia Peruzzo, Vidit Goel, Dejia Xu, Xingqian Xu, Yifan Jiang, Zhangyang Wang, Humphrey Shi, and Nicu Sebe. Vase: Object-centric appearance and shape manipulation of real videos. arXiv preprint arXiv:2401.02473, 2024

work page arXiv 2024
[33]

Compositional 3d scene generation using locally conditioned diffusion

Ryan Po and Gordon Wetzstein. Compositional 3d scene generation using locally conditioned diffusion. arXiv preprint arXiv:2303.12218, 2023

work page arXiv 2023
[34]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

DreamFusion: Text-to-3D using 2D Diffusion

Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022. 11

work page internal anchor Pith review Pith/arXiv arXiv 2022
[36]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[37]

Zero-shot text-to-image generation

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021

work page 2021
[38]

High- resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

work page 2022
[39]

Fleet, and Mohammad Norouzi

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Raphael Gontijo-Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David J. Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Ky...

work page 2022
[40]

Structure-from-motion revisited

Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016

work page 2016
[41]

Pixelwise view selection for unstructured multi-view stereo

Johannes Lutz Schönberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. Pixelwise view selection for unstructured multi-view stereo. In European Conference on Computer Vision (ECCV), 2016

work page 2016
[42]

LAION-5B: An open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion- 5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[43]

Zero123++: a Single Image to Consistent Multi-view Diffusion Base Model

Ruoxi Shi, Hansheng Chen, Zhuoyang Zhang, Minghua Liu, Chao Xu, Xinyue Wei, Linghao Chen, Chong Zeng, and Hao Su. Zero123++: a single image to consistent multi-view diffusion base model. arXiv preprint arXiv:2310.15110, 2023

work page internal anchor Pith review arXiv 2023
[44]

Film studies: An introduction

Ed Sikov. Film studies: An introduction. Columbia University Press, 2020

work page 2020
[45]

Light field networks: Neural scene representations with single-evaluation rendering

Vincent Sitzmann, Semon Rezchikov, Bill Freeman, Josh Tenenbaum, and Fredo Durand. Light field networks: Neural scene representations with single-evaluation rendering. Advances in Neural Information Processing Systems, 34:19313–19325, 2021

work page 2021
[46]

DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation

Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

Consistent view synthesis with pose-guided diffusion models

Hung-Yu Tseng, Qinbo Li, Changil Kim, Suhib Alsisan, Jia-Bin Huang, and Johannes Kopf. Consistent view synthesis with pose-guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16773–16783, 2023

work page 2023
[48]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[49]

Sv3d: Novel multi-view synthesis and 3d generation from a single image using latent video diffusion.arXiv preprint arXiv:2403.12008, 2024

Vikram V oleti, Chun-Han Yao, Mark Boss, Adam Letts, David Pankratz, Dmitry Tochilkin, Christian Laforte, Robin Rombach, and Varun Jampani. Sv3d: Novel multi-view synthesis and 3d generation from a single image using latent video diffusion.arXiv preprint arXiv:2403.12008, 2024

work page arXiv 2024
[50]

Taming mode collapse in score distillation for text-to-3d generation

Peihao Wang, Dejia Xu, Zhiwen Fan, Dilin Wang, Sreyas Mohan, Forrest Iandola, Rakesh Ranjan, Yilei Li, Qiang Liu, Zhangyang Wang, et al. Taming mode collapse in score distillation for text-to-3d generation. arXiv preprint arXiv:2401.00909, 2023

work page arXiv 2023
[51]

Motionctrl: A unified and flexible motion controller for video generation

Zhouxia Wang, Ziyang Yuan, Xintao Wang, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. arXiv preprint arXiv:2312.03641, 2023

work page arXiv 2023
[52]

Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation

Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7623–7633, 2023

work page 2023
[53]

Neurallift- 360: Lifting an in-the-wild 2d photo to a 3d object with 360 {\deg} views

Dejia Xu, Yifan Jiang, Peihao Wang, Zhiwen Fan, Yi Wang, and Zhangyang Wang. Neurallift- 360: Lifting an in-the-wild 2d photo to a 3d object with 360 {\deg} views. arXiv preprint arXiv:2211.16431, 2022. 12

work page arXiv 2022
[54]

Comp4d: Llm-guided compositional 4d scene generation

Dejia Xu, Hanwen Liang, Neel P Bhatt, Hezhen Hu, Hanxue Liang, Konstantinos N Plataniotis, and Zhangyang Wang. Comp4d: Llm-guided compositional 4d scene generation. arXiv preprint arXiv:2403.16993, 2024

work page arXiv 2024
[55]

VideoGPT: Video Generation using VQ-VAE and Transformers

Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[56]

An embodiment of the cinematographer: emotional and perceptual responses to different camera movement techniques

Mehmet Burak Yilmaz, Elen Lotman, Andres Karjus, and Pia Tikka. An embodiment of the cinematographer: emotional and perceptual responses to different camera movement techniques. Frontiers in Neuroscience, 17:1160843, 2023

work page 2023
[57]

Magvit: Masked generative video transformer

Lijun Yu, Yong Cheng, Kihyuk Sohn, José Lezama, Han Zhang, Huiwen Chang, Alexander G Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, et al. Magvit: Masked generative video transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10459–10469, 2023

work page 2023
[58]

Effi- cient video diffusion models via content-frame motion-latent decomposition

Sihyun Yu, Weili Nie, De-An Huang, Boyi Li, Jinwoo Shin, and Anima Anandkumar. Effi- cient video diffusion models via content-frame motion-latent decomposition. arXiv preprint arXiv:2403.14148, 2024

work page arXiv 2024
[59]

Adding Conditional Control to Text-to-Image Diffusion Models

Lvmin Zhang and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543, 2023

work page internal anchor Pith review arXiv 2023
[60]

Scenewiz3d: Towards text-guided 3d scene composition

Qihang Zhang, Chaoyang Wang, Aliaksandr Siarohin, Peiye Zhuang, Yinghao Xu, Ceyuan Yang, Dahua Lin, Bolei Zhou, Sergey Tulyakov, and Hsin-Ying Lee. Scenewiz3d: Towards text-guided 3d scene composition. arXiv preprint arXiv:2312.08885, 2023

work page arXiv 2023
[61]

Particlesfm: Exploiting dense point trajectories for localizing moving cameras in the wild

Wang Zhao, Shaohui Liu, Hengkai Guo, Wenping Wang, and Yong-Jin Liu. Particlesfm: Exploiting dense point trajectories for localizing moving cameras in the wild. In European Conference on Computer Vision, pages 523–542. Springer, 2022

work page 2022
[62]

Stereo Magnification: Learning View Synthesis using Multiplane Images

Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnifi- cation: Learning view synthesis using multiplane images. arXiv preprint arXiv:1805.09817, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[63]

clean-fid

Qi Zuo, Xiaodong Gu, Lingteng Qiu, Yuan Dong, Zhengyi Zhao, Weihao Yuan, Rui Peng, Siyu Zhu, Zilong Dong, Liefeng Bo, et al. Videomv: Consistent multi-view generation based on large video generative model. arXiv preprint arXiv:2403.12010, 2024. 13 A Additional Details on Epipolar Constraint Attention An epipolar line refers to the projection on one camera...

work page arXiv 2024