VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation

Hang Xu; Sixiao Zheng; Xiangru Huang; Yanpeng Zhou; Yanwei Fu; Yi Zhu; Zimian Peng

arxiv: 2502.07531 · v5 · pith:YMRSHNPDnew · submitted 2025-02-11 · 💻 cs.CV · cs.AI· cs.LG· cs.MM

VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation

Sixiao Zheng , Zimian Peng , Yanpeng Zhou , Yi Zhu , Hang Xu , Xiangru Huang , Yanwei Fu This is my paper

classification 💻 cs.CV cs.AIcs.LGcs.MM

keywords motioncontrollightingobjectcameradirectionvidcraft3accurate

0 comments

read the original abstract

Controllable image-to-video (I2V) generation transforms a reference image into a coherent video guided by user-specified control signals. While precise control over camera motion, object motion, and lighting is essential for high-fidelity creation, existing methods often treat these factors independently. This overlooks the physical coupling among viewpoint, geometry, and illumination in dynamic scenes, leading to visual inconsistencies such as mismatched shadows and perspective drift under simultaneous changes. We present VidCRAFT3, a unified and flexible I2V framework that explicitly models cross-factor interactions among geometry, motion, and illumination, enabling both independent and joint control over camera motion, object motion, and lighting direction. Image2Cloud provides explicit 3D geometric priors for accurate camera motion control. ObjMotionNet encodes sparse object trajectories into multi-scale motion features to guide realistic object motion. A Spatial Triple-Attention Transformer integrates lighting direction through lighting cross-attention for consistent relighting. To address the scarcity of jointly annotated data, we construct the VideoLightingDirection (VLD) dataset with accurate per-frame lighting direction annotations, and introduce a three-stage progressive training strategy that enables robust learning without fully joint annotations. Extensive experiments demonstrate that VidCRAFT3 achieves state-of-the-art performance in control precision and visual coherence across diverse scenarios.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MotiMotion: Motion-Controlled Video Generation with Visual Reasoning
cs.CV 2026-05 unverdicted novelty 7.0

MotiMotion adds visual reasoning via a training-free VLM to refine primary trajectories and hallucinate secondary motions, plus a confidence-aware guidance scheme, yielding more plausible interactions on the new MotiB...
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
cs.CV 2026-04 unverdicted novelty 7.0

UniGeo unifies geometric guidance across three levels in video models to reduce geometric drift and improve consistency in camera-controllable image editing.
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
cs.CV 2026-04 unverdicted novelty 6.0

UniGeo adds unified geometric guidance at three levels in video models to reduce geometric drift and improve structural fidelity in camera-controllable image editing.
World-R1: Reinforcing 3D Constraints for Text-to-Video Generation
cs.CV 2026-04 unverdicted novelty 5.0

World-R1 uses Flow-GRPO reinforcement learning and a new text dataset to enforce 3D consistency in text-to-video generation while keeping the original model's visual quality.
World-R1: Reinforcing 3D Constraints for Text-to-Video Generation
cs.CV 2026-04 unverdicted novelty 5.0

World-R1 applies RL via Flow-GRPO on a new text dataset for world simulation to enforce 3D constraints in video generation while preserving visual quality.
World-R1: Reinforcing 3D Constraints for Text-to-Video Generation
cs.CV 2026-04 unverdicted novelty 4.0

World-R1 uses RL with 3D model feedback and a new text dataset to improve geometric consistency in text-to-video generation while keeping the base model unchanged.