pith. sign in

arxiv: 2606.26738 · v1 · pith:MWPUTJ6Jnew · submitted 2026-06-25 · 💻 cs.CV

Do Image Editing Models Understand Lighting?

Pith reviewed 2026-06-26 05:27 UTC · model grok-4.3

classification 💻 cs.CV
keywords image editinglightinglight transportbenchmarkgenerative modelsHDR datasetphysics consistencylight probe
0
0 comments X

The pith

Image editing models largely reproduce real lighting physics but err more in low-light regions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a benchmark of 1K real captured HDR image pairs showing indoor scenes with physical light probes switched on and off. It tests whether state-of-the-art editing models can produce changes that match measured light transport, using two new scores that factor out unrelated photographic adjustments. Top models turn out to be surprisingly consistent with physics on specular surfaces yet show clear degradation where the probe contributes less light. The work also finds that vision-language models cannot perform pixel-level light transport checks. This supplies a concrete, physics-anchored way to measure an implicit capability that current perceptual benchmarks miss.

Core claim

Using the 3D-anchored Light Probe benchmark built from 1K real-world HDR pairs with annotated regions, the evaluation demonstrates that the strongest image editing models produce edits that align closely with measured real-world light transport, although performance varies across models and remains weaker in regions receiving less light from the probe; vision-language models prove unsuitable for this pixel-level task.

What carries the argument

The 3D-anchored Light Probe (3DLP) benchmark, consisting of physically captured on/off light-probe image pairs and region annotations, together with two new scores that isolate lighting transport accuracy from other generative effects.

If this is right

  • Best-performing models exhibit high consistency with physical light transport on specular highlights.
  • All tested models produce more errors in image regions that receive less light from the probe.
  • Model performance differences are slightly smaller on specular surfaces than on other annotated regions.
  • Vision-language models cannot substitute for pixel-level light-transport evaluation on this task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The benchmark could be extended to test editing models on outdoor scenes or multiple simultaneous light sources.
  • Training pipelines might incorporate explicit supervision from similar on/off probe pairs to reduce low-light errors.
  • The observed low-light bias suggests current models rely on learned appearance priors rather than explicit transport simulation.

Load-bearing premise

The two new scores successfully separate lighting transport accuracy from confounding effects such as white-balance shifts, and the captured pairs contain no uncontrolled variables besides the light-probe state.

What would settle it

Independent re-measurement of the released image pairs with calibrated light meters that finds no systematic increase in error for low-light regions after applying the two scores.

Figures

Figures reproduced from arXiv: 2606.26738 by Carsten Rother, Johann-Friedrich Feiden, Matthias Nie{\ss}ner, Tim K\"uchler.

Figure 1
Figure 1. Figure 1: Our 3DLP Task. (Left) Three examples from our 3DLP dataset with the light probe turned off and on, respectively. The dataset includes a wide range of materials, geometries and ambient lighting effects. (Right) A result from a strong AI model (Nano Banana 2) for the turn-on task, compared to the real image. The AI has turned on the light with a different brightness, and the photo has a different exposure. O… view at source ↗
Figure 2
Figure 2. Figure 2: 3DLP Pipeline illustrating the turn-on task. (Left) Image with a light bulb turned off, I off R , where the stand as well as the bulb are visible (white boxes). The AI is tasked to turn on the bulb with a bright, white light, producing I on AI . To isolate the light transport contribution of the single light bulb, we compute the ratio image of the AI, Eon AI = I on AI /Ioff R , and the real ratio image, Eo… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative results (best viewed zoomed-in) for the turn-on (top) and turn-off (bottom) task. The SIE error map is visualised as S(Et R) − S(Et AI ), where red indicates an excess of light relative to the rest of the image, blue indicates a deficit, and black pixels denote regions excluded due to clipping, low signal, or window labels. Note that the AI is free to adapt the global exposure and white balance… view at source ↗
Figure 4
Figure 4. Figure 4: Light intensity band analysis for the turn-on task. Performance of various AI models evaluated across light intensity bands cast by the light probe. These bands serve as a proxy for the 3D distance to the light source, starting with the highest intensity band (closest to the source) on the left. Both metrics are calculated for the turn-on task. For both metrics, the lowest error is always within the highes… view at source ↗
Figure 5
Figure 5. Figure 5: Material and light effect analysis. For each annotated class the relative SIE scores are shown. This means that the best model always has a value of 0, in most cases Nano Banana Pro, and we measure the relative increase (in percentage) of error for the remaining models. The spread of errors for some classes is higher than for other classes. surfaces and specular highlights [PITH_FULL_IMAGE:figures/full_fi… view at source ↗
Figure 6
Figure 6. Figure 6: Failure case for ambient highlights and shadows for the turn-off task with the Qwen￾Image-Edit model. Left are the real turned-on and turned-off images, as well as the respective difference images. The real difference image (I on R − I off R ) only shows lighting effects caused by the light probe. In contrast, the difference image of the AI (I on R − I off AI ) has additional incorrect effects. We see in t… view at source ↗
Figure 7
Figure 7. Figure 7: Challenging transparent object. Real images (left) alongside the results of four different models. None of the models is able to perform the task correctly for this highly challenging scene. Nonetheless the result of Nano Banana Pro is visually quite convincing. 4.4 Light probe ablation To motivate our choice of a spherical lamp, we capture an additional 150 images for different lamp geometries (see detail… view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative comparison of SIE and LFE versus PSNR, SSIM and LPIPS. Major rows present the real on-image I on R and the AI-generated on-images I on AI . The rows below show the intensity ratios Eon R and Eon AI . For each row the best value per metric is given bold and the second best underlined. Orange and blue boxes indicate samples in which SIE/LFE and PSNR/SSIM/LPIPS are in exceptionally high disagreeme… view at source ↗
Figure 9
Figure 9. Figure 9: Images of tested lamps. Visual reference for the seven different lamp types used in Section 4.4 and Section B.2: Spherical, Area, Basket, Daylight, Directional, Light Bar, and Shaded. For the main 3DLP benchmark ( [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: SIE metric computed per annotated class. Presented on an absolute scale (in comparison to [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: LFE metric computed per annotated class. Presented on an absolute scale (in compari￾son to [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Intensity band visualisations. The difference images I on R − I off R represent the intensity of the light probe. Each band captures a different range of intensities (colour coded for better visibility). The 16−100% band includes the brightest pixels and the 0−1% band the ones with lowest intensity. C.5 Additional qualitative results Figs. 13 and 14 show additional results for the turn-on and turn-off tas… view at source ↗
Figure 13
Figure 13. Figure 13: Qualitative results. This shows a challenging example with a plant next to the light bulb. Notice how GPT Image 1.5 fails to illuminate the plant at all. All other models also struggle with this complex light interaction for the turn-on task. The standardised intensity ratio is defined as Et R for the real image and Et AI for the AI images. The error map of SIE is visualised as S(Et R) − S(Et AI ) and the… view at source ↗
Figure 14
Figure 14. Figure 14: Qualitative results. Notice the highlight on the floor. While all models are able to successfully remove the highlight for the turn-off task, all methods fail to reproduce the exact placement and shape of the highlight for the turn-on task. The standardised intensity ratios is defined as Et R for the real image and Et AI for the AI images. The error map of SIE is visualised as S(Et R) − S(Et AI ) and the … view at source ↗
Figure 15
Figure 15. Figure 15: 3DLP Dataset diversity. We visualise 50 randomly sampled on-images from our dataset. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Annotation examples. Each label is visualised with a different colour. Please note that we prioritise annotation correctness over exhaustive completeness. F Prompt selection Figs. 18 to 25 show a selection of prompts that we tested for each model. The prompt that was finally used for each respective model is printed bold. Note, the selection was done by visual and metric inspection. In total, we tested al… view at source ↗
Figure 17
Figure 17. Figure 17: Capture process. For each pair, we capture two 9-exposure HDR brackets using a Sony α6600 with 14-bit precision, once for the on-image and once for the off-image. To ensure both images have an identical ambient light noise level, we fix the aperture, focus, and exposure time for all images in a given view. The exposure is chosen to produce a well-exposed on-image. After capturing, we average the individua… view at source ↗
Figure 18
Figure 18. Figure 18: Prompt selection for Nano Banana Pro: It is evident that Nano Banana Pro reliably completes and understands the task. Across all prompts, the model successfully detects the correct lamp and turns on its light while keeping the ambient light unchanged. Based on the metrics calculated across all tested prompts, we decided to use the prompt in bold. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Prompt selection for Nano Banana 2: Like Nano Banana Pro, Nano Banana 2 has no problem in executing the task correctly. Similar to the Pro version, we selected the prompt that produced the best metrics for our dataset (shown in bold). 28 [PITH_FULL_IMAGE:figures/full_fig_p028_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Prompt selection for Qwen-Image-Edit: As shown, Qwen-Image-Edit exhibits variations in performance depending on the prompt. It fails to understand the task for the prompt used in row 4. In three out of the four images, Qwen-Image-Edit turns on the wrong lamp or does not illuminate the scene at all. Based on visual inspection and metrics, we decided to use the prompt marked in bold. 29 [PITH_FULL_IMAGE:fi… view at source ↗
Figure 21
Figure 21. Figure 21: Prompt selection for Flux 2 Max: As seen in rows 5-7, Flux 2 Max tends to produce highly variable results for certain prompts. We obtained the most consistent results using a simple prompt: “Turn on the light bulb on the black pole.” 30 [PITH_FULL_IMAGE:figures/full_fig_p030_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Prompt selection for Flux 2 Dev: As with Flux 2 Max, the choice of prompt effects the performance of Flux 2 Dev considerably, more than other models like Qwen-Image-Edit or Nano Banana Pro. Based on metrics and visual inspection, we chose the prompt printed in bold. Even though Flux 2 Dev misinterpreted the lamp position in one of the examples, the metrics for this prompt remained best. 31 [PITH_FULL_IMA… view at source ↗
Figure 23
Figure 23. Figure 23: Prompt selection for GPT Image 1.5: For GPT Image 1.5, the 5th row produced the most consistent results. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Prompt selection for Bagel 7B MoT: As seen in the images, Bagel struggles to illuminate the scene, even though it turns on the right light most of the time. Based on performance metrics we chose the prompt in row 4 (bold). Note that due to these problems the model was not selected for the main evaluation. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Prompt selection for OmniGen2: This model struggles the most with turning the correct light on. Oftentimes, it just applies a bright spot to the image. However, using the prompt in row 4, OmniGen2 was still able to illuminate the scene somehow. Note that due to these problems the model was not selected for the main evaluation. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_25.png] view at source ↗
read the original abstract

While recent advancements in generative image editing models have achieved stunning visual fidelity, it remains an open question whether these systems possess an intrinsic knowledge of real-world lighting. Existing benchmarks typically evaluate high-level plausibility of perceptual light transport on curated internet imagery, using VLMs or human judgement, or they rely on synthetically generated datasets. In this work, we introduce the 3D-anchored Light Probe (3DLP) benchmark, for which we have captured a new high-fidelity HDR dataset of real-world lighting changes. The dataset consists of 1K image pairs of diverse indoor scenery in which light probes are physically turned on and off. To allow for a granular performance analysis, we annotated specific image regions such as cast shadows or metallic surfaces. With this data, we evaluate a range of state-of-the-art image editing models by measuring how well their light probe edits align with reality. The evaluation uses two new scores to compensate for AI-generated photographic effects, such as adjusted white balance. Our results show that the overall performance of models differs considerably, with differences slightly less pronounced for specular highlights. The best image editing models are remarkably consistent with real-world physics, however, they still leave room for improvement. We observe that image regions that receive less light from the light probe are more prone to errors for all models. Furthermore, building on their success in evaluating macroscopic lighting plausibility, we test VLMs on our task but find that they are unsuitable for pixel-level light transport analysis. We will make the benchmark, together with the real-world dataset, publicly available to encourage future research on this topic.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces the 3DLP benchmark based on a newly captured high-fidelity HDR dataset of 1K real-world indoor image pairs in which light probes are physically toggled on and off. It evaluates a range of state-of-the-art image editing models by measuring alignment of their light-probe edits with the captured ground truth, using two newly proposed scores intended to compensate for AI-generated photographic effects such as white-balance shifts. The central claims are that the best models are remarkably consistent with real-world physics (with somewhat smaller differences on specular highlights), that errors are more frequent in regions receiving less light from the probe, and that VLMs are unsuitable for pixel-level light-transport analysis. The benchmark and dataset are to be released publicly.

Significance. If the evaluation protocol and new scores are shown to isolate lighting transport, the work supplies a rare real-world, physics-grounded benchmark that moves beyond synthetic data or VLM/human perceptual judgments. The public release of the 1K annotated pairs and the observation of systematic low-light errors would be useful contributions to the field.

major comments (2)
  1. [Evaluation section] Evaluation section (around the definition of the two new scores): the manuscript provides no ablation, correlation analysis, or controlled experiment demonstrating that the scores successfully remove confounds such as model-induced white-balance or exposure shifts; without this, the reported consistency numbers and the low-light error pattern cannot be attributed solely to lighting transport accuracy.
  2. [Section 3] Dataset collection protocol (Section 3): the claim that the 1K pairs differ solely in light-probe state is load-bearing for all quantitative results, yet the text does not report explicit controls or measurements for camera stability, ambient light drift, or other capture variables between the on/off pairs.
minor comments (2)
  1. The exact mathematical definitions of the two new scores should be moved from any appendix into the main text with a short worked example on one image pair.
  2. Figure captions for the annotated regions (cast shadows, metallic surfaces) should explicitly state the annotation protocol and inter-annotator agreement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive review. We address each of the major comments below.

read point-by-point responses
  1. Referee: [Evaluation section] Evaluation section (around the definition of the two new scores): the manuscript provides no ablation, correlation analysis, or controlled experiment demonstrating that the scores successfully remove confounds such as model-induced white-balance or exposure shifts; without this, the reported consistency numbers and the low-light error pattern cannot be attributed solely to lighting transport accuracy.

    Authors: We agree that explicit validation of the two new scores is needed to confirm they isolate lighting transport. In the revision we will add an ablation study that measures score correlation with ground-truth probe intensity changes, plus controlled experiments that apply synthetic white-balance and exposure shifts to the ground-truth pairs and verify that the scores remain stable. These additions will directly address the concern about confounds. revision: yes

  2. Referee: [Section 3] Dataset collection protocol (Section 3): the claim that the 1K pairs differ solely in light-probe state is load-bearing for all quantitative results, yet the text does not report explicit controls or measurements for camera stability, ambient light drift, or other capture variables between the on/off pairs.

    Authors: Section 3 already states that a fixed tripod and a single controlled indoor session were used for each scene. We acknowledge that quantitative drift measurements were omitted. The revised manuscript will add a dedicated paragraph reporting (i) repeated camera-position checks with a laser level, (ii) ambient-light readings taken before and after each pair, and (iii) the observed maximum drift values, all of which remained below the noise floor of the HDR capture. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark with physical ground truth exhibits no circularity

full rationale

The paper captures a new real-world HDR dataset of 1K image pairs differing only by physical light-probe state, annotates regions, and evaluates model edits via direct comparison to this ground truth using two new scores designed to compensate for AI photographic effects. No equations, fitted parameters, self-citations, or ansatzes reduce the reported consistency scores or error patterns to self-referential quantities by construction. The central claim of model consistency with real-world physics rests on independent physical measurements rather than any derivation that collapses to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation rests on the assumption that turning physical light probes on and off produces isolated, measurable lighting changes and that the new scores correctly factor out non-lighting AI artifacts.

axioms (1)
  • domain assumption The captured image pairs accurately isolate lighting changes from other scene variables.
    Invoked when treating the on/off pairs as ground-truth light transport.

pith-pipeline@v0.9.1-grok · 5826 in / 1173 out tokens · 27133 ms · 2026-06-26T05:27:46.529336+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 2 canonical work pages

  1. [1]

    Adobe ushers in a new era of creativity with new creative agent and generative AI innovations in Adobe Firefly, April 2026

    Adobe Inc. Adobe ushers in a new era of creativity with new creative agent and generative AI innovations in Adobe Firefly, April 2026. URL https://news.adobe.com/news/2026/04/ adobe-new-creative-agent

  2. [2]

    Blended latent diffusion.ACM transactions on graphics (TOG), 42(4):1–11, 2023

    Omri Avrahami, Ohad Fried, and Dani Lischinski. Blended latent diffusion.ACM transactions on graphics (TOG), 42(4):1–11, 2023

  3. [3]

    Text2live: Text-driven layered image and video editing

    Omer Bar-Tal, Dolev Ofri-Amar, Rafail Fridman, Yoni Kasten, and Tali Dekel. Text2live: Text-driven layered image and video editing. InEuropean conference on computer vision, pages 707–723. Springer, 2022

  4. [4]

    Switchlight 3.0 is here, November 2025

    Beeble AI. Switchlight 3.0 is here, November 2025. URL https://beeble.ai/research/ switchlight-3-0-is-here

  5. [5]

    Instructpix2pix: Learning to follow image editing instructions

    Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18392–18402, 2023

  6. [6]

    Hidream-i1: A high-efficient image generative foundation model with sparse diffusion transformer.arXiv preprint arXiv:2505.22705, 2025

    Qi Cai, Jingwen Chen, Yang Chen, Yehao Li, Fuchen Long, Yingwei Pan, Zhaofan Qiu, Yiheng Zhang, Fengbin Gao, Peihan Xu, et al. Hidream-i1: A high-efficient image generative foundation model with sparse diffusion transformer.arXiv preprint arXiv:2505.22705, 2025

  7. [7]

    Diffusionlight-turbo: Accelerated light probes for free via single-pass chrome ball inpainting

    Worameth Chinchuthakun, Pakkapon Phongthawee, Amit Raj, Varun Jampani, Pramook Khungurn, and Supasorn Suwajanakorn. Diffusionlight-turbo: Accelerated light probes for free via single-pass chrome ball inpainting. InArXiv, 2025

  8. [8]

    Insert in style: A zero-shot generative framework for harmonious cross-domain object composition.arXiv preprint arXiv:2511.15197, 2025

    Raghu Vamsi Chittersu, Yuvraj Singh Rathore, Pranav Adlinge, and Kunal Swami. Insert in style: A zero-shot generative framework for harmonious cross-domain object composition.arXiv preprint arXiv:2511.15197, 2025

  9. [9]

    Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

  10. [10]

    Commonsense-t2i challenge: Can text-to-image generation models understand commonsense?arXiv preprint arXiv:2406.07546, 2024

    Xingyu Fu, Muyu He, Yujie Lu, William Yang Wang, and Dan Roth. Commonsense-t2i challenge: Can text-to-image generation models understand commonsense?arXiv preprint arXiv:2406.07546, 2024

  11. [11]

    Introducing nano banana pro: Gemini 3 pro’s image generation and editing model

    Google. Introducing nano banana pro: Gemini 3 pro’s image generation and editing model. Google Cloud Blog, 11 2025. URL https://blog.google/innovation-and-ai/products/nano-banana-pro/ . Introducing Nano Banana Pro

  12. [12]

    Build with nano banana 2, our best image generation and editing model

    Google. Build with nano banana 2, our best image generation and editing model. Google Blog, 2 2026. URL https://blog.google/innovation-and-ai/technology/developers-tools/ build-with-nano-banana-2/

  13. [13]

    Mahfuzur Rahman, Fahad Rahman, Mohd Ariful Haque, and Sunzida Siddique

    Kishor Datta Gupta, Marufa Kamal, Md. Mahfuzur Rahman, Fahad Rahman, Mohd Ariful Haque, and Sunzida Siddique. Physics-based benchmarking metrics for multimodal synthetic images, 2026. URL https://arxiv.org/abs/2511.15204

  14. [14]

    Unirelight: Learning joint decomposition and synthesis for video relighting.arXiv preprint arXiv:2506.15673, 2025

    Kai He, Ruofan Liang, Jacob Munkberg, Jon Hasselgren, Nandita Vijaykumar, Alexander Keller, Sanja Fidler, Igor Gilitschenski, Zan Gojcic, and Zian Wang. Unirelight: Learning joint decomposition and synthesis for video relighting.arXiv preprint arXiv:2506.15673, 2025

  15. [15]

    Prompt-to- prompt image editing with cross attention control.arXiv preprint arXiv:2208.01626, 2022

    Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to- prompt image editing with cross attention control.arXiv preprint arXiv:2208.01626, 2022

  16. [16]

    Paralleledits: Efficient multi-object image editing.arXiv preprint arXiv:2406.00985, 2024

    Mingzhen Huang, Jialing Cai, Shan Jia, Vishnu Suresh Lokhande, and Siwei Lyu. Paralleledits: Efficient multi-object image editing.arXiv preprint arXiv:2406.00985, 2024

  17. [17]

    Repurposing diffusion-based image generators for monocular depth estimation

    Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Metzger, Rodrigo Caye Daudt, and Konrad Schindler. Repurposing diffusion-based image generators for monocular depth estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  18. [18]

    Marigold: Affordable adaptation of diffusion-based image generators for image analysis.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

    Bingxin Ke, Kevin Qu, Tianfu Wang, Nando Metzger, Shengyu Huang, Bo Li, Anton Obukhov, and Konrad Schindler. Marigold: Affordable adaptation of diffusion-based image generators for image analysis.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 11

  19. [19]

    Intrinsic image fusion for multi-view 3d material reconstruction.ArXiv, 2025

    Peter Kocsis, Lukas Höllein, and Matthias Nießner. Intrinsic image fusion for multi-view 3d material reconstruction.ArXiv, 2025

  20. [20]

    Learning action and reasoning-centric image editing from videos and simulation.Advances in Neural Information Processing Systems, 37:38035–38078, 2024

    Benno Krojer, Dheeraj Vattikonda, Luis Lara, Varun Jampani, Eva Portelance, Christopher Pal, and Siva Reddy. Learning action and reasoning-centric image editing from videos and simulation.Advances in Neural Information Processing Systems, 37:38035–38078, 2024

  21. [21]

    Flux.2 [dev]: 32b parameter rectified flow transformer, 2025

    Black Forest Labs. Flux.2 [dev]: 32b parameter rectified flow transformer, 2025. URL https:// huggingface.co/black-forest-labs/FLUX.2-dev

  22. [22]

    FLUX.2: Frontier Visual Intelligence

    Black Forest Labs. FLUX.2: Frontier Visual Intelligence. https://bfl.ai/blog/flux-2, 2025. Technical report for the FLUX.2 family including Max

  23. [23]

    Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742, 2025

  24. [24]

    Genai-bench: Evaluating and improving compositional text-to-visual generation.arXiv preprint arXiv:2406.13743, 2024

    Baiqi Li, Zhiqiu Lin, Deepak Pathak, Jiayao Li, Yixin Fei, Kewen Wu, Tiffany Ling, Xide Xia, Pengchuan Zhang, Graham Neubig, et al. Genai-bench: Evaluating and improving compositional text-to-visual generation.arXiv preprint arXiv:2406.13743, 2024

  25. [25]

    Openrooms: An open framework for photorealistic indoor scene datasets

    Zhengqin Li, Ting-Wei Yu, Shen Sang, Sarah Wang, Meng Song, Yuhan Liu, Yu-Ying Yeh, Rui Zhu, Nitesh Gundavarapu, Jia Shi, et al. Openrooms: An open framework for photorealistic indoor scene datasets. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7190–7199, 2021

  26. [26]

    Luxremix: Lighting decomposition and remixing for indoor scenes.arXiv preprint arXiv:2601.15283, 2026

    Ruofan Liang, Norman Müller, Ethan Weber, Duncan Zauss, Nandita Vijaykumar, Peter Kontschieder, and Christian Richardt. Luxremix: Lighting decomposition and remixing for indoor scenes.arXiv preprint arXiv:2601.15283, 2026

  27. [27]

    Pi-light: Physics-inspired diffusion for full-image relighting.arXiv preprint arXiv:2601.22135, 2026

    Zhexin Liang, Zhaoxi Chen, Yongwei Chen, Tianyi Wei, Tengfei Wang, and Xingang Pan. Pi-light: Physics-inspired diffusion for full-image relighting.arXiv preprint arXiv:2601.22135, 2026

  28. [28]

    I2ebench: A comprehensive benchmark for instruction-based image editing.Advances in Neural Information Processing Systems, 37:41494–41516, 2024

    Yiwei Ma, Jiayi Ji, Ke Ye, Weihuang Lin, Zhibin Wang, Yonghan Zheng, Qiang Zhou, Xiaoshuai Sun, and Rongrong Ji. I2ebench: A comprehensive benchmark for instruction-based image editing.Advances in Neural Information Processing Systems, 37:41494–41516, 2024

  29. [29]

    Lightlab: Controlling light sources in images with diffusion models

    Nadav Magar, Amir Hertz, Eric Tabellion, Yael Pritch, Alex Rav-Acha, Ariel Shamir, and Yedid Hoshen. Lightlab: Controlling light sources in images with diffusion models. InProceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, pages 1–11, 2025

  30. [30]

    Do-undo: Generating and reversing physical actions in vision-language models.arXiv preprint arXiv:2512.13609, 2025

    Shweta Mahajan, Shreya Kadambi, Hoang Le, Munawar Hayat, and Fatih Porikli. Do-undo: Generating and reversing physical actions in vision-language models.arXiv preprint arXiv:2512.13609, 2025

  31. [31]

    Phybench: A physical commonsense benchmark for evaluating text-to-image models.arXiv preprint arXiv:2406.11802, 2024

    Fanqing Meng, Wenqi Shao, Lixin Luo, Yahong Wang, Yiran Chen, Quanfeng Lu, Yue Yang, Tianshuo Yang, Kaipeng Zhang, Yu Qiao, et al. Phybench: A physical commonsense benchmark for evaluating text-to-image models.arXiv preprint arXiv:2406.11802, 2024

  32. [32]

    Wise: A world knowledge-informed semantic evaluation for text-to-image generation.arXiv preprint arXiv:2503.07265, 2025

    Yuwei Niu, Munan Ning, Mengren Zheng, Weiyang Jin, Bin Lin, Peng Jin, Jiaqi Liao, Chaoran Feng, Kunpeng Ning, Bin Zhu, et al. Wise: A world knowledge-informed semantic evaluation for text-to-image generation.arXiv preprint arXiv:2503.07265, 2025

  33. [33]

    Addendum to gpt-4o system card: Native image generation

    OpenAI. Addendum to gpt-4o system card: Native image generation. Technical report, OpenAI, 2025. URLhttps://openai.com/index/gpt-4o-system-card-addendum. Introducing GPT Image 1

  34. [34]

    The new chatgpt images is here

    OpenAI. The new chatgpt images is here. OpenAI News, 12 2025. URL https://openai.com/index/ new-chatgpt-images-is-here/. Introducing GPT Image 1.5

  35. [35]

    Diffusionlight: Light probes for free by painting a chrome ball

    Pakkapon Phongthawee, Worameth Chinchuthakun, Nontaphat Sinsunthithet, Varun Jampani, Amit Raj, Pramook Khungurn, and Supasorn Suwajanakorn. Diffusionlight: Light probes for free by painting a chrome ball. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 98–108, 2024

  36. [36]

    Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023. 12

  37. [37]

    Picabench: How far are we from physically realistic image editing?arXiv preprint arXiv:2510.17681, 2025

    Yuandong Pu, Le Zhuo, Songhao Han, Jinbo Xing, Kaiwen Zhu, Shuo Cao, Bin Fu, Si Liu, Hongsheng Li, Yu Qiao, Wenlong Zhang, Xi Chen, and Yihao Liu. Picabench: How far are we from physically realistic image editing?arXiv preprint arXiv:2510.17681, 2025

  38. [38]

    Infinigen indoors: Photorealistic indoor scenes using procedural generation

    Alexander Raistrick, Lingjie Mei, Karhan Kayan, David Yan, Yiming Zuo, Beining Han, Hongyu Wen, Meenal Parakh, Stamatis Alexandropoulos, Lahav Lipson, Zeyu Ma, and Jia Deng. Infinigen indoors: Photorealistic indoor scenes using procedural generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21783–21...

  39. [39]

    Susskind

    Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M. Susskind. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. InInternational Conference on Computer Vision (ICCV) 2021, 2021

  40. [40]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

  41. [41]

    Shadows don’t lie and lines can’t bend! generative models don’t know projective geometry

    Ayush Sarkar, Hanlin Mai, Amitabh Mahapatra, Svetlana Lazebnik, David A Forsyth, and Anand Bhattad. Shadows don’t lie and lines can’t bend! generative models don’t know projective geometry... for now. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 28140–28149, 2024

  42. [42]

    Synclight: Controllable and consistent multi-view relighting.arXiv preprint arXiv:2601.16981, 2026

    David Serrano-Lozano, Anand Bhattad, Luis Herranz, Jean-François Lalonde, and Javier Vazquez-Corral. Synclight: Controllable and consistent multi-view relighting.arXiv preprint arXiv:2601.16981, 2026

  43. [43]

    T2i-reasonbench: Benchmarking reasoning-informed text-to-image generation.arXiv preprint arXiv:2508.17472, 2025

    Kaiyue Sun, Rongyao Fang, Chengqi Duan, Xian Liu, and Xihui Liu. T2i-reasonbench: Benchmarking reasoning-informed text-to-image generation.arXiv preprint arXiv:2508.17472, 2025

  44. [44]

    Spatiotemporally consistent indoor lighting estimation with diffusion priors

    Mutian Tong, Rundi Wu, and Changxi Zheng. Spatiotemporally consistent indoor lighting estimation with diffusion priors. InProceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, pages 1–11, 2025

  45. [45]

    Relight my nerf: A dataset for novel view synthesis and relighting of real world objects

    Marco Toschi, Riccardo De Matteo, Riccardo Spezialetti, Daniele De Gregorio, Luigi Di Stefano, and Samuele Salti. Relight my nerf: A dataset for novel view synthesis and relighting of real world objects. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20762–20772, 2023

  46. [46]

    Lmm4lmm: Benchmarking and evaluating large-multimodal image generation with lmms

    Jiarui Wang, Huiyu Duan, Yu Zhao, Juntong Wang, Guangtao Zhai, and Xiongkuo Min. Lmm4lmm: Benchmarking and evaluating large-multimodal image generation with lmms. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17312–17323, 2025

  47. [47]

    Imagen editor and editbench: Advancing and evaluating text-guided image inpainting

    Su Wang, Chitwan Saharia, Ceslee Montgomery, Jordi Pont-Tuset, Shai Noy, Stefano Pellegrini, Yasumasa Onoe, Sarah Laszlo, David J Fleet, Radu Soricut, et al. Imagen editor and editbench: Advancing and evaluating text-guided image inpainting. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18359–18369, 2023

  48. [48]

    Genspace: Benchmarking spatially-aware image generation.arXiv preprint arXiv:2505.24870, 2025

    Zehan Wang, Jiayang Xu, Ziang Zhang, Tianyu Pang, Chao Du, Hengshuang Zhao, and Zhou Zhao. Genspace: Benchmarking spatially-aware image generation.arXiv preprint arXiv:2505.24870, 2025

  49. [49]

    Everything in its place: Benchmarking spatial intelligence of text-to-image models.arXiv preprint arXiv:2601.20354, 2026

    Zengbin Wang, Xuecai Hu, Yong Wang, Feng Xiong, Man Zhang, and Xiangxiang Chu. Everything in its place: Benchmarking spatial intelligence of text-to-image models.arXiv preprint arXiv:2601.20354, 2026

  50. [50]

    Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

  51. [51]

    Qwen-image technical report, 2025

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun...

  52. [52]

    Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871, 2025

    Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, Ze Liu, Ziyi Xia, Chaofan Li, Haoge Deng, Jiahao Wang, Kun Luo, Bo Zhang, Defu Lian, Xinlong Wang, Zhongyuan Wang, Tiejun Huang, and Zheng Liu. Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871, 2025. 13

  53. [53]

    Kris-bench: Benchmarking next-level intelligent image editing models

    Yongliang Wu, Zonghui Li, Xinting Hu, Xinyu Ye, Xianfang Zeng, Gang Yu, Wenbo Zhu, Bernt Schiele, Ming-Hsuan Yang, and Xu Yang. Kris-bench: Benchmarking next-level intelligent image editing models. arXiv preprint arXiv:2505.16707, 2025

  54. [54]

    Ictpolarreal: A polarized reflection and material dataset of real world objects.arXiv preprint arXiv:2603.24912, 2026

    Jing Yang, Krithika Dharanikota, Emily Jia, Haiwei Chen, and Yajie Zhao. Ictpolarreal: A polarized reflection and material dataset of real world objects.arXiv preprint arXiv:2603.24912, 2026

  55. [55]

    Primedepth: Efficient monocular depth estimation with a stable diffusion preimage

    Denis Zavadski, Damjan Kalšan, and Carsten Rother. Primedepth: Efficient monocular depth estimation with a stable diffusion preimage. InProceedings of the Asian Conference on Computer Vision, pages 922–940, 2024

  56. [56]

    RGB↔X: Image decomposition and synthesis using material- and lighting-aware diffusion models

    Zheng Zeng, Valentin Deschaintre, Iliyan Georgiev, Yannick Hold-Geoffroy, Yiwei Hu, Fujun Luan, Ling- Qi Yan, and Miloš Hašan. RGB↔X: Image decomposition and synthesis using material- and lighting-aware diffusion models. InACM SIGGRAPH 2024 Conference Papers, SIGGRAPH ’24, New York, NY , USA,

  57. [57]

    RGB X: Image decomposition and synthesis using material- and lighting-aware diffusion models , year =

    Association for Computing Machinery. ISBN 9798400705250. doi: 10.1145/3641519.3657445. URLhttps://doi.org/10.1145/3641519.3657445

  58. [58]

    Magicbrush: A manually annotated dataset for instruction-guided image editing.Advances in Neural Information Processing Systems, 36:31428–31449, 2023

    Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. Magicbrush: A manually annotated dataset for instruction-guided image editing.Advances in Neural Information Processing Systems, 36:31428–31449, 2023

  59. [59]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018

  60. [60]

    Envisioning beyond the pixels: Benchmarking reasoning-informed visual editing.arXiv preprint arXiv:2504.02826, 2025

    Xiangyu Zhao, Peiyuan Zhang, Kexian Tang, Xiaorong Zhu, Hao Li, Wenhao Chai, Zicheng Zhang, Renqiu Xia, Guangtao Zhai, Junchi Yan, et al. Envisioning beyond the pixels: Benchmarking reasoning-informed visual editing.arXiv preprint arXiv:2504.02826, 2025

  61. [61]

    Olatverse: A large-scale real-world object dataset with precise lighting control.arXiv preprint arXiv:2511.02483, 2025

    Xilong Zhou, Jianchun Chen, Pramod Rao, Timo Teufel, Linjie Lyu, Tigran Minasian, Oleksandr Sotny- chenko, Xiao-Xiao Long, Marc Habermann, and Christian Theobalt. Olatverse: A large-scale real-world object dataset with precise lighting control.arXiv preprint arXiv:2511.02483, 2025

  62. [62]

    Learning-based inverse rendering of complex indoor scenes with dif- ferentiable monte carlo raytracing

    Jingsen Zhu, Fujun Luan, Yuchi Huo, Zihao Lin, Zhihua Zhong, Dianbing Xi, Rui Wang, Hujun Bao, Jiaxiang Zheng, and Rui Tang. Learning-based inverse rendering of complex indoor scenes with dif- ferentiable monte carlo raytracing. InSIGGRAPH Asia 2022 Conference Papers. ACM, 2022. URL https://doi.org/10.1145/3550469.3555407

  63. [63]

    Irisformer: Dense vision transformers for single-image inverse rendering in indoor scenes

    Rui Zhu, Zhengqin Li, Janarbek Matai, Fatih Porikli, and Manmohan Chandraker. Irisformer: Dense vision transformers for single-image inverse rendering in indoor scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2822–2831, June 2022

  64. [64]

    Do Nothing

    Xiaorong Zhu, Ziheng Jia, Jiarui Wang, Xiangyu Zhao, Haodong Duan, Xiongkuo Min, Jia Wang, Zicheng Zhang, and Guangtao Zhai. Gobench: Benchmarking geometric optics generation and understanding of mllms. InProceedings of the 33rd ACM International Conference on Multimedia, pages 12690–12697, 2025. 14 A Proofs and scores A.1 Proof of metric invariance We sh...

  65. [65]

    Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

    Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...