pith. sign in

arxiv: 2606.20556 · v1 · pith:XXRRYDNQnew · submitted 2026-06-18 · 💻 cs.CV

Thinking in Boxes: 3D Editing in Real Images Made Easy

Pith reviewed 2026-06-26 18:25 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D image editingbox conditioningreal photograph manipulationgeometric transformationsviewpoint changeobject identity preservationdepth reference plane
0
0 comments X

The pith

User-specified 3D input and output boxes turn real-image editing into a precise geometry task that handles large rotations, scales, and viewpoint shifts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that text or 2D prompts give ambiguous control over big spatial changes in photos, while 3D boxes supplied as exact before-and-after specifications remove that ambiguity. The method uses color-coded box faces to define translation, rotation, scaling, and camera motion, then conditions a generator on a depth-aligned floor plane so the output stays consistent with scene appearance and fills in previously hidden surfaces. Training proceeds in two stages on synthetic multi-object scenes followed by a small collection of real Objectron videos, after which the system applies to complex in-the-wild photographs without further fine-tuning. If correct, this yields edits that preserve object identity far better than prior approaches and directly operate on ordinary photographs rather than requiring 3D models or multiple views.

Core claim

By treating 3D boxes as structured specifications rather than loose location hints, the approach casts editing as a well-posed geometry problem whose solution is an image generator conditioned on both the boxes and a depth-aligned planar floor reference; the resulting system, trained only on synthetic data plus limited real video, produces identity-preserving results under large transformations on real photographs and outperforms recent state-of-the-art methods.

What carries the argument

The thinking-in-boxes interface in which a user draws an input 3D box on the source image and an output 3D box that encodes the desired transformation, together with a depth-aligned planar floor that supplies a global reference frame.

If this is right

  • Precise 3D transformations become possible without the ambiguity of text or 2D conditioning.
  • Unseen object regions are recovered consistently with the supplied geometry.
  • The same generator works on ordinary real photographs after training on synthetic data and limited real video.
  • Performance exceeds recent state-of-the-art methods specifically on large-scale 3D edits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same box specification could be applied frame-by-frame to produce 3D-consistent video edits.
  • Pairing the method with single-image depth estimators might remove the need for any real video during training.
  • The floor-plane reference suggests the approach could extend to edits that also move the camera around non-planar scenes if additional reference surfaces are supplied.

Load-bearing premise

Training on synthetic multi-object scenes plus a small set of real videos is sufficient for the generator to generalize to complex in-the-wild photographs while preserving identity under large transformations.

What would settle it

A test set of real photographs containing extreme object rotations or viewpoint changes where the method either distorts object identity or produces implausible geometry in newly visible regions would falsify the generalization claim.

Figures

Figures reproduced from arXiv: 2606.20556 by Anand Bhattad, D.A. Forsyth, Naveen Chandra R, Pradhaan S Bhat, Rishubh Parihar, R. Venkatesh Babu, Vaibhav Vavilala.

Figure 1
Figure 1. Figure 1: Thinking in boxes. Given a single source image (bottom left), the user fits 3D boxes around objects of interest, anchored on a depth-aligned floor (top left). Editing the boxes drives the corresponding edit in the image: from left to right, translation, rotation, combined translation and rotation, and a camera viewpoint change. The bottles’ and helmets’ appearance is preserved across all four edits, includ… view at source ↗
Figure 2
Figure 2. Figure 2: Editing Pipeline. (1) Users provide a real source image and (2) fit 3D boxes to the objects within the scene using a point-and-click interface. (3) The boxes can be manipulated in 3D space, allowing for scaling, rotation, translation, and camera moves. Both the source and target box layouts are projected into 2D and serve, alongside the source image, as inputs to an image editing model [34]. (4) The model … view at source ↗
Figure 3
Figure 3. Figure 3: Training pipeline. A training exam￾ple provides a source image, a noisy target im￾age, and the source and target box projections that specify the transformation T between them. All four are encoded by the VAE and concatenated spatially. Joint attention inside FLUX-Kontext ap￾plies T to the image latents – mapping the source latent to the target latent under box-pair condition￾ing. LoRA layers in the attent… view at source ↗
Figure 4
Figure 4. Figure 4: Dataset rendering pipeline. (1) 3D assets are drawn from a subset of Objaverse-XL [15]; HDRIs and floor textures are taken from 3D-Fixer [71]. (2) Each scene places [N] scale-normalized objects on a textured floor under a sampled HDRI, with collision-aware placement. We then perturb the objects in rotation [0, 2π], scale [0.5, 1.5], and position to produce a second scene configuration. (3) Blender Cycles r… view at source ↗
Figure 5
Figure 5. Figure 5: In-the-wild results. Nine real-image edits across diverse indoor and outdoor scenes (offices, dining rooms, street scenes, desert, living rooms). Each panel shows the source image with its fitted boxes (top-left inset), the user’s edited boxes (top-right inset), and the resulting generation. Edits include translation, rotation, scaling, and camera viewpoint changes; the same model handles all four operatio… view at source ↗
Figure 6
Figure 6. Figure 6: User Study. Each bar represents user preference rate for our method compared to base￾lines, categorized by specific evaluation criteria. We conducted an A/B user study with 49 partic￾ipants, each asked to select the preferred out￾put from randomly sampled generations pro￾duced by our method and three baselines: Free￾Fine [79], SAM3D [11], and SpatialEdit [66]. We evaluate a) object preservation, b) back￾gr… view at source ↗
Figure 7
Figure 7. Figure 7: Scene editing across operations, scenes, and rendering styles. Each row shows a source image (column 1) edited under four operations: translation, scaling, rotation, and a large combined transformation (columns 2–5). The same trained model handles all operations across all scenes – including the line drawing in row 1, despite training only on photographic data. The model preserves rendering style under the… view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative comparison on real images. For each example, the user specifies an edit by editing 3D boxes (columns 2–3), which we render to 2D as the target conditioning input for our method and convert to the appropriate input format for each baseline (mask, depth, control points). The examples shown are deliberately hard: every subject is a non-rigid animal (zebras, impala, polar bear, chicken), and the ed… view at source ↗
Figure 9
Figure 9. Figure 9: GBW’s reported fail￾ure cases. This is an example GBW [60] flags as difficult for their method. Our box-space con￾ditioning handles it, while their depth-and-warp convex primitive pipeline does not. Source (left), GBW (middle), ours (right). We report quantitative metrics against both 3D editing and im￾age editing baselines. Results are shown in Tab. 1 and Tab. 2. Our method significantly outperforms prior… view at source ↗
Figure 10
Figure 10. Figure 10: Limitation: Our model fails when multiple objects could plausibly satisfy the target layout. In both examples, either object could be moved to match the target boxes (small objects, top row; large objects, bottom row), and the model refuses to commit to one – producing an identity transformation instead. Acknowledgement. This work is supported by PMRF by Govt. of India. References [1] Vaibhav Agrawal, Ris… view at source ↗
Figure 11
Figure 11. Figure 11: In-the-wild 3D editing: We demonstrate the robustness of our method by showcasing additional 3D edits across real world scenes having different lighting, viewpoints and styles. Our method although trained on one or two object scenes, generalizes to complex multi object layouts in real images. Notably we demonstrate all 3D edit operations in this figure from object-centric transla￾tion, scaling and rotatio… view at source ↗
Figure 12
Figure 12. Figure 12: Smooth Interpolation: We demonstrate smooth interpolation between the source image and target layout. We overlay the source object position in terms of 2D bounding box on each image to clearly distinguish the incremental layout adjustments and this suggests that the model has internally learnt how to adjusts objects smoothly through box transformations. Additionally, we compare our model’s results with le… view at source ↗
Figure 14
Figure 14. Figure 14: Comparison with SOTA Models. We evaluate our method against leading proprietary image generative models, Gemini and ChatGPT, on 3D editing tasks within real-world scenes. Since these models primarily rely on text-based conditioning, which lacks the spatial granularity necessary for 3D manipulation, they often fail to execute the precise 3D edits required by users. A.2 Quantitative results In the camera ed… view at source ↗
Figure 13
Figure 13. Figure 13: Rethinking a scene: This figure illustrates our model’s ability to manipulate 3D scene layouts using primitive box representations. We demonstrate precise control over object location, orientation, and scale, enabling complex 3D transformations such as extreme viewing angles that require the seamless reorganization of multiple scene elements. Ultimately, our method serves as a robust tool for performing d… view at source ↗
Figure 15
Figure 15. Figure 15: Statistics of training dataset. (a) The distribution of the minimum visibility ratio across objects within each scene is left-skewed and highly concentrated near unity (median=0.84). This profile indicates that while the majority of scenes exhibit only mild inter-object occlusion, a substantial long tail extending to 0.01 ensures the inclusion of severe occlusion cases. (b) Object orientations follow a ne… view at source ↗
Figure 16
Figure 16. Figure 16: Object category distribution across the 10,143 downloaded assets, derived via multi-label [PITH_FULL_IMAGE:figures/full_fig_p020_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Sample illustrations from our dataset’s environmental and material assets. [PITH_FULL_IMAGE:figures/full_fig_p020_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Various ObjaverseXL assets rendered in Blender, which are used in training are visualized [PITH_FULL_IMAGE:figures/full_fig_p021_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Ablation visualization. Adding the checkered floor pattern for ground localization and orientation-specific face colors for the conditioning box yields generations that are substantially more faithful to the edit instructions. Without the floor (No Floor), the model fails to preserve the texture and ground contact of the animals, as seen most clearly in the first row. With a uniformly colored box (Single … view at source ↗
Figure 20
Figure 20. Figure 20: Ablation Dataset visualization. Here we show a sample rendering of the assets with the same box colors and without floor. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Participants are shown the input image, source and target 3D layouts, and two candidate [PITH_FULL_IMAGE:figures/full_fig_p023_21.png] view at source ↗
read the original abstract

Text and 2D-conditioning interfaces provide weak, ambiguous control over spatial transformations in image editing -- particularly under large object motions and camera changes. Prior work has used 3D primitives such as boxes, but only as loose conditioning signals indicating approximate object location rather than specifying the transformation. We instead use 3D boxes as structured specifications: the user provides the input and output boxes of the edit, casting editing as a well-posed geometry problem. This ``thinking in boxes'' interface, where each box face is color-coded to convey 3D orientation, gives precise control over translation, rotation, scaling, and viewpoint changes in real images while preserving scene and object identity, and recovering previously unseen object regions. To ground transformations in scene appearance, we introduce a depth-aligned planar floor as a global reference frame, shaded with depth-aware cues. Conditioned on this structure, an image generator produces consistent results under large transformations. Trained in two stages -- on synthetic multi-object scenes and a small set of real-world videos from Objectron -- the system generalizes to complex, in-the-wild real images. Our method operates directly on real photographs and substantially outperforms recent state-of-the-art methods on large 3D edits.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes using 3D boxes as precise, structured specifications (with color-coded faces indicating orientation) for 3D image editing in real photographs, rather than loose conditioning. Users specify input/output boxes to control translation, rotation, scaling, and viewpoint changes. A depth-aligned planar floor provides a global reference frame with depth-aware shading. An image generator is trained in two stages—first on synthetic multi-object scenes, then on a small set of Objectron real-world videos—and is claimed to generalize to complex in-the-wild images while preserving identity and recovering unseen regions. The central claim is that the method operates directly on real photos and substantially outperforms recent SOTA methods on large 3D edits.

Significance. If the generalization from the described training regime and the outperformance claims hold, the work would provide a practical, geometry-grounded interface for controllable 3D edits that addresses weaknesses in 2D text or conditioning methods. The structured use of boxes and the floor reference frame are concrete ideas that could influence future editing tools.

major comments (2)
  1. [Abstract] Abstract: the claim that the method 'substantially outperforms recent state-of-the-art methods on large 3D edits' is unsupported by any quantitative results, baselines, metrics, error bars, or dataset details. This absence makes the central empirical claim unevaluable from the manuscript text.
  2. [Training procedure] Training description (two-stage procedure on synthetic scenes followed by a small Objectron set): no evidence or analysis is supplied showing that this limited data distribution suffices to guarantee identity preservation and hallucination of unseen regions under large transformations on arbitrary in-the-wild images; the generalization step is load-bearing for the main claim but lacks supporting experiments or regularization arguments.
minor comments (1)
  1. The color-coding scheme for box faces and the depth-aware shading on the floor plane would benefit from an explicit figure or diagram early in the paper to clarify the interface.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the thorough review of our manuscript. We value the referee's emphasis on empirical rigor and address the major comments point by point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that the method 'substantially outperforms recent state-of-the-art methods on large 3D edits' is unsupported by any quantitative results, baselines, metrics, error bars, or dataset details. This absence makes the central empirical claim unevaluable from the manuscript text.

    Authors: While the abstract does not include numerical results, the full manuscript presents extensive qualitative evaluations in Section 4, comparing our approach to recent methods on large 3D edits with visual evidence of better identity preservation and region hallucination. Quantitative metrics are not standard for this task as they often fail to reflect perceptual quality in 3D transformations; we instead provide user studies or visual comparisons. We will revise the abstract to clarify that the outperformance is based on qualitative assessments to avoid any ambiguity. revision: partial

  2. Referee: [Training procedure] Training description (two-stage procedure on synthetic scenes followed by a small Objectron set): no evidence or analysis is supplied showing that this limited data distribution suffices to guarantee identity preservation and hallucination of unseen regions under large transformations on arbitrary in-the-wild images; the generalization step is load-bearing for the main claim but lacks supporting experiments or regularization arguments.

    Authors: The paper includes ablation studies and results on out-of-distribution in-the-wild images to demonstrate generalization. The two-stage training allows the model to learn precise 3D transformations from synthetic data and adapt to real lighting and textures from Objectron. The depth-aware floor and box conditioning provide strong inductive biases that aid generalization without requiring large real datasets. We will expand the discussion in the revised version to include more analysis on why this training suffices and any limitations. revision: partial

Circularity Check

0 steps flagged

No circularity: method relies on external training data and empirical generalization claims

full rationale

The paper describes a two-stage training pipeline on synthetic multi-object scenes followed by a small set of Objectron videos, then claims generalization to in-the-wild images. No equations, fitted parameters, or self-referential definitions appear in the provided text. The central claim is an empirical assertion about out-of-distribution performance rather than a derivation that reduces to its own inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling are present. This is a standard non-finding for a methods paper whose load-bearing elements are data-driven rather than algebraic.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, mathematical axioms, or newly postulated entities; the depth-aligned planar floor is introduced as a modeling choice rather than a derived entity.

pith-pipeline@v0.9.1-grok · 5776 in / 1098 out tokens · 29445 ms · 2026-06-26T18:25:52.293610+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

77 extracted references · 13 linked inside Pith

  1. [1]

    Seethrough3d: Occlusion aware 3d control in text-to-image generation

    Vaibhav Agrawal, Rishubh Parihar, Pradhaan Bhat, Ravi Kiran Sarvadevabhatla, and R Venkatesh Babu. Seethrough3d: Occlusion aware 3d control in text-to-image generation. In CVPR, 2026

  2. [2]

    Objectron: A large scale dataset of object-centric videos in the wild with pose annotations

    Adel Ahmadyan, Liangkai Zhang, Artsiom Ablavatski, Jianing Wei, and Matthias Grundmann. Objectron: A large scale dataset of object-centric videos in the wild with pose annotations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7822–7831, 2021

  3. [3]

    Magic fixup: Streamlining photo editing by watching dynamic videos.ACM Transactions on Graphics, 44(5):1–25, 2025

    Hadi Alzayer, Zhihao Xia, Xuaner Zhang, Eli Shechtman, Jia-Bin Huang, and Michael Gharbi. Magic fixup: Streamlining photo editing by watching dynamic videos.ACM Transactions on Graphics, 44(5):1–25, 2025

  4. [4]

    Stable flow: Vital layers for training-free image editing

    Omri Avrahami, Or Patashnik, Ohad Fried, Egor Nemchinov, Kfir Aberman, Dani Lischinski, and Daniel Cohen-Or. Stable flow: Vital layers for training-free image editing. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7877–7888, 2025

  5. [5]

    Loosecontrol: Lifting controlnet for generalized depth conditioning

    Shariq Farooq Bhat, Niloy Mitra, and Peter Wonka. Loosecontrol: Lifting controlnet for generalized depth conditioning. InACM SIGGRAPH 2024 Conference Papers, pages 1–11, 2024

  6. [6]

    Instructpix2pix: Learning to follow image editing instructions

    Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18392–18402, 2023

  7. [7]

    Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing

    Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xiaohu Qie, and Yinqiang Zheng. Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. In Proceedings of the IEEE/CVF international conference on computer vision, pages 22560–22570, 2023

  8. [8]

    Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

    Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

  9. [9]

    Blenderfusion: 3d-grounded visual editing and generative compositing.arXiv preprint arXiv:2506.17450, 2025

    Jiacheng Chen, Ramin Mehran, Xuhui Jia, Saining Xie, and Sanghyun Woo. Blenderfusion: 3d-grounded visual editing and generative compositing.arXiv preprint arXiv:2506.17450, 2025

  10. [10]

    Anydoor: Zero-shot object-level image customization

    Xi Chen, Lianghua Huang, Yu Liu, Yujun Shen, Deli Zhao, and Hengshuang Zhao. Anydoor: Zero-shot object-level image customization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6593–6602, 2024. 10

  11. [11]

    Sam 3d: 3dfy anything in images.arXiv preprint arXiv:2511.16624, 2025

    Xingyu Chen, Fu-Jen Chu, Pierre Gleize, Kevin J Liang, Alexander Sax, Hao Tang, Weiyao Wang, Michelle Guo, Thibaut Hardin, Xiang Li, et al. Sam 3d: 3dfy anything in images.arXiv preprint arXiv:2511.16624, 2025

  12. [12]

    Learning continuous 3d words for text-to-image generation

    Ta-Ying Cheng, Matheus Gadelha, Thibault Groueix, Matthew Fisher, Radomir Mech, Andrew Markham, and Niki Trigoni. Learning continuous 3d words for text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6753–6762, 2024

  13. [13]

    Marble: Material recomposi- tion and blending in clip-space

    Ta Ying Cheng, Prafull Sharma, Mark Boss, and Varun Jampani. Marble: Material recomposi- tion and blending in clip-space. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13061–13071, 2025

  14. [14]

    3d-fixup: Advancing photo editing with 3d priors

    Yen-Chi Cheng, Krishna Kumar Singh, Jae Shin Yoon, Alexander Schwing, Liang-Yan Gui, Matheus Gadelha, Paul Guerrero, and Nanxuan Zhao. 3d-fixup: Advancing photo editing with 3d priors. InProceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, pages 1–10, 2025

  15. [15]

    Objaverse-xl: A universe of 10m+ 3d objects.Advances in Neural Information Processing Systems, 36:35799–35813, 2023

    Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram V oleti, Samir Yitzhak Gadre, et al. Objaverse-xl: A universe of 10m+ 3d objects.Advances in Neural Information Processing Systems, 36:35799–35813, 2023

  16. [16]

    Strobl, Matthias Humt, and Rudolph Triebel

    Maximilian Denninger, Dominik Winkelbauer, Martin Sundermeyer, Wout Boerdijk, Markus Knauer, Klaus H. Strobl, Matthias Humt, and Rudolph Triebel. Blenderproc2: A procedural pipeline for photorealistic rendering.Journal of Open Source Software, 8(82):4901, 2023

  17. [17]

    Build-a-scene: Interactive 3d layout control for diffusion-based image generation.arXiv preprint arXiv:2408.14819, 2024

    Abdelrahman Eldesokey and Peter Wonka. Build-a-scene: Interactive 3d layout control for diffusion-based image generation.arXiv preprint arXiv:2408.14819, 2024

  18. [18]

    Neural usd: An object-centric framework for iterative editing and control.arXiv preprint arXiv:2510.23956, 2025

    Alejandro Escontrela, Shrinu Kushagra, Sjoerd van Steenkiste, Yulia Rubanova, Aleksander Holynski, Kelsey Allen, Kevin Murphy, and Thomas Kipf. Neural usd: An object-centric framework for iterative editing and control.arXiv preprint arXiv:2510.23956, 2025

  19. [19]

    Scaling rectified flow trans- formers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

  20. [20]

    Dreamsim: Learning new dimensions of human visual similarity using synthetic data.arXiv preprint arXiv:2306.09344, 2023

    Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. Dreamsim: Learning new dimensions of human visual similarity using synthetic data.arXiv preprint arXiv:2306.09344, 2023

  21. [21]

    An image is worth one word: Personalizing text-to-image generation using textual inversion

    Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit Haim Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. InThe Eleventh International Conference on Learning Representations

  22. [22]

    Concept sliders: Lora adaptors for precise control in diffusion models

    Rohit Gandikota, Joanna Materzy´nska, Tingrui Zhou, Antonio Torralba, and David Bau. Concept sliders: Lora adaptors for precise control in diffusion models. InEuropean Conference on Computer Vision, pages 172–188. Springer, 2024

  23. [23]

    Tokenverse: Versatile multi-concept personalization in token modulation space.ACM Transactions On Graphics (TOG), 44(4):1–11, 2025

    Daniel Garibi, Shahar Yadin, Roni Paiss, Omer Tov, Shiran Zada, Ariel Ephrat, Tomer Michaeli, Inbar Mosseri, and Tali Dekel. Tokenverse: Versatile multi-concept personalization in token modulation space.ACM Transactions On Graphics (TOG), 44(4):1–11, 2025

  24. [24]

    Prompt-to-prompt image editing with cross attention control

    Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control. 2022

  25. [25]

    Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022. 11

  26. [26]

    Wilddet3d: Scaling promptable 3d detection in the wild.arXiv preprint arXiv:2604.08626, 2026

    Weikai Huang, Jieyu Zhang, Sijun Li, Taoyang Jia, Jiafei Duan, Yunqian Cheng, Jaemin Cho, Mattew Wallingford, Rustin Soraki, Chris Dongjoo Kim, et al. Wilddet3d: Scaling promptable 3d detection in the wild.arXiv preprint arXiv:2604.08626, 2026

  27. [27]

    Clipdrag: Combining text-based and drag-based instructions for image editing

    Ziqi Jiang, Zhen Wang, and Long Chen. Clipdrag: Combining text-based and drag-based instructions for image editing. InInternational Conference on Learning Representations, volume 2025, pages 5237–5253, 2025

  28. [28]

    Saedit: Token-level control for continuous image editing via sparse autoencoder.arXiv preprint arXiv:2510.05081, 2025

    Ronen Kamenetsky, Sara Dorfman, Daniel Garibi, Roni Paiss, Or Patashnik, and Daniel Cohen- Or. Saedit: Token-level control for continuous image editing via sparse autoencoder.arXiv preprint arXiv:2510.05081, 2025

  29. [29]

    Imagic: Text-based real image editing with diffusion models

    Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6007–6017, 2023

  30. [30]

    Customizing text-to-image diffusion with object viewpoint control

    Nupur Kumari, Grace Su, Richard Zhang, Taesung Park, Eli Shechtman, and Jun-Yan Zhu. Customizing text-to-image diffusion with object viewpoint control. InSIGGRAPH Asia 2024 Conference Papers, pages 1–13, 2024

  31. [31]

    Generating multi-image synthetic data for text-to-image customization

    Nupur Kumari, Xi Yin, Jun-Yan Zhu, Ishan Misra, and Samaneh Azadi. Generating multi-image synthetic data for text-to-image customization. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16524–16534, 2025

  32. [32]

    Multi- concept customization of text-to-image diffusion

    Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi- concept customization of text-to-image diffusion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1931–1941, 2023

  33. [33]

    Flux.https://github.com/black-forest-labs/flux, 2024

    Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

  34. [34]

    Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742, 2025

  35. [35]

    Lightlab: Controlling light sources in images with diffusion models

    Nadav Magar, Amir Hertz, Eric Tabellion, Yael Pritch, Alex Rav-Acha, Ariel Shamir, and Yedid Hoshen. Lightlab: Controlling light sources in images with diffusion models. InProceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, pages 1–11, 2025

  36. [37]

    Object 3dit: Language-guided 3d-aware image editing.Advances in Neural Information Processing Systems, 36:3497–3516, 2023

    Oscar Michel, Anand Bhattad, Eli VanderBilt, Ranjay Krishna, Aniruddha Kembhavi, and Tanmay Gupta. Object 3dit: Language-guided 3d-aware image editing.Advances in Neural Information Processing Systems, 36:3497–3516, 2023

  37. [38]

    Origen: Zero-shot 3d orientation grounding in text-to-image generation

    Yunhong Min, Daehyeon Choi, Kyeongmin Yeo, Jihyun Lee, and Minhyuk Sung. Origen: Zero-shot 3d orientation grounding in text-to-image generation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

  38. [39]

    Prodigy: An expeditiously adaptive parameter-free learner.arXiv preprint arXiv:2306.06101, 2023

    Konstantin Mishchenko and Aaron Defazio. Prodigy: An expeditiously adaptive parameter-free learner.arXiv preprint arXiv:2306.06101, 2023

  39. [40]

    Null-text inversion for editing real images using guided diffusion models

    Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real images using guided diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6038–6047, 2023

  40. [41]

    Dragondiffusion: Enabling drag-style manipulation on diffusion models

    Chong Mou, Xintao Wang, Jiechong Song, Ying Shan, and Jian Zhang. Dragondiffusion: Enabling drag-style manipulation on diffusion models. InInternational Conference on Learning Representations, volume 2024, pages 31620–31631, 2024. 12

  41. [42]

    Diffusion handles enabling 3d edits for diffusion models by lifting activations to 3d

    Karran Pandey, Paul Guerrero, Matheus Gadelha, Yannick Hold-Geoffroy, Karan Singh, and Niloy J Mitra. Diffusion handles enabling 3d edits for diffusion models by lifting activations to 3d. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7695–7704, 2024

  42. [43]

    Compass control: Multi object orientation control for text-to-image generation

    Rishubh Parihar, Vaibhav Agrawal, Sachidanand VS, and Venkatesh Babu Radhakrishnan. Compass control: Multi object orientation control for text-to-image generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2791–2801, 2025

  43. [44]

    Kontinuous kontext: Continuous strength control for instruction-based image editing.arXiv preprint arXiv:2510.08532, 2025

    Rishubh Parihar, Or Patashnik, Daniil Ostashev, R Venkatesh Babu, Daniel Cohen-Or, and Kuan-Chieh Wang. Kontinuous kontext: Continuous strength control for instruction-based image editing.arXiv preprint arXiv:2510.08532, 2025

  44. [45]

    Zero-shot depth aware image editing with diffusion models

    Rishubh Parihar, Sachidanand VS, and R Venkatesh Babu. Zero-shot depth aware image editing with diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15748–15759, 2025

  45. [46]

    Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

  46. [47]

    Unidepth: Universal monocular metric depth estimation

    Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mattia Segu, Siyuan Li, Luc Van Gool, and Fisher Yu. Unidepth: Universal monocular metric depth estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10106–10116, 2024

  47. [48]

    Grounded sam: Assembling open-world models for diverse visual tasks.arXiv preprint arXiv:2401.14159, 2024

    Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, et al. Grounded sam: Assembling open-world models for diverse visual tasks.arXiv preprint arXiv:2401.14159, 2024

  48. [49]

    Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

    Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500–22510, 2023

  49. [50]

    Geodiffuser: Geometry-based image editing with diffusion models

    Rahul Sajnani, Jeroen Vanbaar, Jie Min, Kapil D Katyal, and Srinath Sridhar. Geodiffuser: Geometry-based image editing with diffusion models. InProceedings of the Winter Conference on Applications of Computer Vision, pages 472–482, 2025

  50. [51]

    Alchemist: Parametric control of material properties with diffusion models

    Prafull Sharma, Varun Jampani, Yuanzhen Li, Xuhui Jia, Dmitry Lagun, Fredo Durand, Bill Freeman, and Mark Matthews. Alchemist: Parametric control of material properties with diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24130–24141, 2024

  51. [53]

    Dragdiffusion: Harnessing diffusion models for interactive point-based image editing

    Yujun Shi, Chuhui Xue, Jun Hao Liew, Jiachun Pan, Hanshu Yan, Wenqing Zhang, Vincent YF Tan, and Song Bai. Dragdiffusion: Harnessing diffusion models for interactive point-based image editing. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8839–8849, 2024

  52. [54]

    Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3. arXiv preprint arXiv:2508.10104, 2025

  53. [55]

    Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020. 13

  54. [56]

    Ominicontrol: Minimal and universal control for diffusion transformer

    Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, and Xinchao Wang. Ominicontrol: Minimal and universal control for diffusion transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14940–14950, 2025

  55. [57]

    Ominicontrol2: Efficient conditioning for diffusion transformers.arXiv preprint arXiv:2503.08280, 2025

    Zhenxiong Tan, Qiaochu Xue, Xingyi Yang, Songhua Liu, and Xinchao Wang. Ominicontrol2: Efficient conditioning for diffusion transformers.arXiv preprint arXiv:2503.08280, 2025

  56. [58]

    Emer- gent correspondence from image diffusion.Advances in neural information processing systems, 36:1363–1389, 2023

    Luming Tang, Menglin Jia, Qianqian Wang, Cheng Perng Phoo, and Bharath Hariharan. Emer- gent correspondence from image diffusion.Advances in neural information processing systems, 36:1363–1389, 2023

  57. [59]

    Raft: Recurrent all-pairs field transforms for optical flow

    Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. In European conference on computer vision, pages 402–419. Springer, 2020

  58. [60]

    Gener- ative blocks world: Moving things around in pictures

    Vaibhav Vavilala, Seemandhar Jain, Rahul Vasanth, David Forsyth, and Anand Bhattad. Gener- ative blocks world: Moving things around in pictures. InICLR, 2026

  59. [61]

    Diffusers: State-of-the-art diffusion models

    Patrick von Platen, Suraj Patil, Anton Lozhkov, Pedro Cuenca, Nathan Lambert, Kashif Rasul, Mishig Davaadorj, Dhruv Nair, Sayak Paul, William Berman, Yiyi Xu, Steven Liu, and Thomas Wolf. Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/ diffusers, 2022

  60. [62]

    Bovik, H.R

    Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4):600–612, 2004

  61. [63]

    Qwen-image technical report, 2025

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun...

  62. [64]

    Neural assets: 3d-aware multi-object scene synthesis with image diffusion models.Advances in Neural Information Processing Systems, 37:76289–76318, 2024

    Ziyi Wu, Yulia Rubanova, Rishabh Kabra, Drew A Hudson, Igor Gilitschenski, Yusuf Aytar, Sjoerd Van Steenkiste, Kelsey R Allen, and Thomas Kipf. Neural assets: 3d-aware multi-object scene synthesis with image diffusion models.Advances in Neural Information Processing Systems, 37:76289–76318, 2024

  63. [65]

    Native and compact structured latents for 3d generation.arXiv preprint arXiv:2512.14692, 2025

    Jianfeng Xiang, Xiaoxue Chen, Sicheng Xu, Ruicheng Wang, Zelong Lv, Yu Deng, Hongyuan Zhu, Yue Dong, Hao Zhao, Nicholas Jing Yuan, et al. Native and compact structured latents for 3d generation.arXiv preprint arXiv:2512.14692, 2025

  64. [66]

    Spatialedit: Benchmarking fine-grained image spatial editing.arXiv preprint arXiv:2604.04911, 2026

    Yicheng Xiao, Wenhu Zhang, Lin Song, Yukang Chen, Wenbo Li, Nan Jiang, Tianhe Ren, Haokun Lin, Wei Huang, Haoyang Huang, et al. Spatialedit: Benchmarking fine-grained image spatial editing.arXiv preprint arXiv:2604.04911, 2026

  65. [67]

    In- stantmesh: Efficient 3d mesh generation from a single image with sparse-view large reconstruc- tion models.arXiv preprint arXiv:2404.07191, 2024

    Jiale Xu, Weihao Cheng, Yiming Gao, Xintao Wang, Shenghua Gao, and Ying Shan. In- stantmesh: Efficient 3d mesh generation from a single image with sparse-view large reconstruc- tion models.arXiv preprint arXiv:2404.07191, 2024

  66. [68]

    Paint by example: Exemplar-based image editing with diffusion models

    Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin Chen, Xiaoyan Sun, Dong Chen, and Fang Wen. Paint by example: Exemplar-based image editing with diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18381–18391, 2023

  67. [69]

    Depth anything: Unleashing the power of large-scale unlabeled data

    Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10371–10381, 2024

  68. [70]

    Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721, 2023

    Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721, 2023. 14

  69. [71]

    3d- fixer: Coarse-to-fine in-place completion for 3d scenes from a single image.arXiv preprint arXiv:2604.04406, 2026

    Ze-Xin Yin, Liu Liu, Xinjie Wang, Wei Sui, Zhizhong Su, Jian Yang, and Jin Xie. 3d- fixer: Coarse-to-fine in-place completion for 3d scenes from a single image.arXiv preprint arXiv:2604.04406, 2026

  70. [72]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InIEEE International Conference on Computer Vision (ICCV), 2023

  71. [73]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023

  72. [74]

    The unrea- sonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unrea- sonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018

  73. [75]

    Geometric image editing via effects- sensitive in-context inpainting with diffusion transformers.arXiv preprint arXiv:2602.08388, 2026

    Shuo Zhang, Wenzhuo Wu, Huayu Zhang, Jiarong Cheng, Xianghao Zang, Chao Ban, Hao Sun, Zhongjiang He, Tianwei Cao, Kongming Liang, et al. Geometric image editing via effects- sensitive in-context inpainting with diffusion transformers.arXiv preprint arXiv:2602.08388, 2026

  74. [76]

    Easycontrol: Adding efficient and flexible control for diffusion transformer

    Yuxuan Zhang, Yirui Yuan, Yiren Song, Haofan Wang, and Jiaming Liu. Easycontrol: Adding efficient and flexible control for diffusion transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19513–19524, 2025

  75. [77]

    Gooddrag: Towards good practices for drag editing with diffusion models

    Zewei Zhang, Huan Liu, Jun Chen, and Xiangyu Xu. Gooddrag: Towards good practices for drag editing with diffusion models. InInternational Conference on Learning Representations, volume 2025, pages 13785–13808, 2025

  76. [78]

    Stable virtual camera: Generative view synthesis with diffusion models

    Jensen Zhou, Hang Gao, Vikram V oleti, Aaryaman Vasishta, Chun-Han Yao, Mark Boss, Philip Torr, Christian Rupprecht, and Varun Jampani. Stable virtual camera: Generative view synthesis with diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12405–12414, 2025

  77. [79]

    Training-free geometric image editing on diffusion models

    Hanshen Zhu, Zhen Zhu, Kaile Zhang, Yiming Gong, Yuliang Liu, and Xiang Bai. Training-free geometric image editing on diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19130–19140, 2025. 15 Supplemental Materials Table of Contents A.Additional Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . ....