VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation
read the original abstract
Controllable image-to-video (I2V) generation transforms a reference image into a coherent video guided by user-specified control signals. While precise control over camera motion, object motion, and lighting is essential for high-fidelity creation, existing methods often treat these factors independently. This overlooks the physical coupling among viewpoint, geometry, and illumination in dynamic scenes, leading to visual inconsistencies such as mismatched shadows and perspective drift under simultaneous changes. We present VidCRAFT3, a unified and flexible I2V framework that explicitly models cross-factor interactions among geometry, motion, and illumination, enabling both independent and joint control over camera motion, object motion, and lighting direction. Image2Cloud provides explicit 3D geometric priors for accurate camera motion control. ObjMotionNet encodes sparse object trajectories into multi-scale motion features to guide realistic object motion. A Spatial Triple-Attention Transformer integrates lighting direction through lighting cross-attention for consistent relighting. To address the scarcity of jointly annotated data, we construct the VideoLightingDirection (VLD) dataset with accurate per-frame lighting direction annotations, and introduce a three-stage progressive training strategy that enables robust learning without fully joint annotations. Extensive experiments demonstrate that VidCRAFT3 achieves state-of-the-art performance in control precision and visual coherence across diverse scenarios.
This paper has not been read by Pith yet.
Forward citations
Cited by 6 Pith papers
-
MotiMotion: Motion-Controlled Video Generation with Visual Reasoning
MotiMotion adds visual reasoning via a training-free VLM to refine primary trajectories and hallucinate secondary motions, plus a confidence-aware guidance scheme, yielding more plausible interactions on the new MotiB...
-
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
UniGeo unifies geometric guidance across three levels in video models to reduce geometric drift and improve consistency in camera-controllable image editing.
-
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
UniGeo adds unified geometric guidance at three levels in video models to reduce geometric drift and improve structural fidelity in camera-controllable image editing.
-
World-R1: Reinforcing 3D Constraints for Text-to-Video Generation
World-R1 uses Flow-GRPO reinforcement learning and a new text dataset to enforce 3D consistency in text-to-video generation while keeping the original model's visual quality.
-
World-R1: Reinforcing 3D Constraints for Text-to-Video Generation
World-R1 applies RL via Flow-GRPO on a new text dataset for world simulation to enforce 3D constraints in video generation while preserving visual quality.
-
World-R1: Reinforcing 3D Constraints for Text-to-Video Generation
World-R1 uses RL with 3D model feedback and a new text dataset to improve geometric consistency in text-to-video generation while keeping the base model unchanged.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.