pith. sign in

Insertanywhere: Bridging 4d scene geometry and diffusion models for realistic video object in- sertion

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it
abstract

Recent advances in diffusion models have enabled impressive video editing capabilities, yet production-grade Video Object Insertion (VOI) remains challenging due to inadequate 4D scene understanding and a lack of proper optical interactions, such as shadows and reflections. To address these limitations, we present InsertAnywhere, a comprehensive VOI framework that achieves geometrically grounded object placement and optics-aware video synthesis. Our approach first leverages a 4D-aware mask generation module that allows users to anchor an object's 3D pose in a single frame. The framework automatically propagates this placement across the video, accurately handling local scene dynamics and occlusions. To synthesize realistic physical lighting interactions, we introduce Optics-Aware Representation Alignment, a novel strategy that utilizes an extended mask to guide feature extraction, enabling optical effects to seamlessly extend beyond the inserted object's boundary. Finally, to overcome the lack of training data for such phenomena, we construct and open-source ROSE++, a specialized quadruplet dataset tailored for the supervised learning of optical effects. Extensive experiments demonstrate that InsertAnywhere produces geometrically plausible and photometrically realistic insertions in complex real-world scenarios, significantly outperforming existing research and commercial generative tools.

citation-role summary

background 1 method 1

citation-polarity summary

fields

cs.CV 3 cs.GR 1

years

2026 4

verdicts

UNVERDICTED 4

clear filters

representative citing papers

AlbedoEdit: Unified Instance-Level Video Editing with Albedo Guidance

cs.GR · 2026-05-31 · unverdicted · novelty 6.0

AlbedoEdit fine-tunes video foundation models to translate RGB videos into edited versions conditioned on user-edited first-frame albedo maps, trained on a new synthetic paired dataset for insertion, removal, and texture tasks.

Controllable Video Object Insertion via Multiview Priors

cs.CV · 2026-04-16 · unverdicted · novelty 5.0

A multi-view prior-based framework for video object insertion that uses dual-path conditioning and an integration-aware consistency module to improve appearance stability and occlusion handling.

citing papers explorer

Showing 1 of 1 citing paper after filters.

  • Controllable Video Object Insertion via Multiview Priors cs.CV · 2026-04-16 · unverdicted · none · ref 24 · internal anchor

    A multi-view prior-based framework for video object insertion that uses dual-path conditioning and an integration-aware consistency module to improve appearance stability and occlusion handling.