pith. sign in

arxiv: 2605.18735 · v1 · pith:NEWRBEXGnew · submitted 2026-05-18 · 💻 cs.CV · cs.GR· cs.LG

PIXLRelight: Controllable Relighting via Intrinsic Conditioning

Pith reviewed 2026-05-20 11:10 UTC · model grok-4.3

classification 💻 cs.CV cs.GRcs.LG
keywords relightingintrinsic decompositionneural renderingPBR lightingsingle-imagetransformerphysically based renderingfeed-forward
0
0 comments X p. Extension
pith:NEWRBEXG Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{NEWRBEXG}

Prints a linked pith:NEWRBEXG badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

A neural renderer transfers arbitrary PBR lighting to a single photograph by conditioning on intrinsic maps obtained identically from real multi-illumination images or path-traced coarse 3D renders.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a feed-forward method that lets users control lighting on one photo using physically based rendering parameters. It trains on pairs of real photos taken under different lights by first breaking each image into albedo, diffuse shading, and non-diffuse residuals. At test time the same breakdown is computed from a quick path-traced render of a rough 3D model under the desired lights, then a transformer applies those new lighting components to the original photo while keeping fine detail through per-pixel affine modulation. If the approach works, single-image relighting becomes both physically controllable and fast enough for interactive use without per-image optimization or chained inverse-forward rendering errors.

Core claim

PIXLRelight bridges physically based rendering and learned image synthesis through a shared intrinsic conditioning signal. Training decomposes real multi-illumination photograph pairs into albedo, diffuse shading, and non-diffuse residuals. Inference computes the identical conditioning from a path-traced render of a coarse 3D reconstruction lit by user-specified PBR lights. A transformer-based neural renderer then modulates the source photograph with the target illumination via per-pixel affine modulation, preserving detail while achieving arbitrary lighting control.

What carries the argument

Shared intrinsic conditioning (albedo, diffuse shading, and non-diffuse residuals) extracted consistently from either real multi-illumination photographs or path-traced renders of coarse 3D reconstructions.

If this is right

  • Users gain arbitrary PBR-style lighting control from a single input photograph.
  • Relighting quality reaches state-of-the-art levels while eliminating error accumulation from chained inverse and forward rendering.
  • Inference completes in under one tenth of a second per image without per-image optimization.
  • The same conditioning pipeline works for both training data from real photographs and inference data from simulated renders.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could support real-time relighting in AR or VR once a coarse 3D proxy is available from a single view.
  • Extending the decomposition to include specular or subsurface components would widen the range of materials that can be relit without retraining.
  • Combining the approach with better single-image 3D reconstruction would reduce reliance on coarse proxies and improve handling of complex geometry.

Load-bearing premise

The intrinsic decomposition obtained from a path-traced render of a coarse 3D reconstruction matches the decomposition from real multi-illumination photographs closely enough to support faithful lighting transfer.

What would settle it

Apply the trained model to a real photograph whose target lighting is taken from a different real capture of the same scene, then measure whether pixel-wise error and perceptual metrics remain comparable to methods that use ground-truth lighting maps.

Figures

Figures reproduced from arXiv: 2605.18735 by Miguel Farinha, Ronald Clark.

Figure 1
Figure 1. Figure 1: PIXLRELIGHT is a feed-forward transformer that produces photorealistic relightings of in-the-wild photographs in a single pass. The user authors the target illumination in a physically based renderer, with full control over light type, position, color, and intensity. The model is conditioned on the resulting intrinsic decomposition of the target appearance – albedo, diffuse shading, and a non-diffuse resid… view at source ↗
Figure 2
Figure 2. Figure 2: Training pipeline. The source image is patchified by a ViT branch and the channel-wise concatenated target intrinsics – extracted from the target image by a frozen Marigold-IID-Lighting model – are patchified by a ConvNeXt branch. The two token grids are fused per spatial location, projected to a common dimension, and processed by a self-attention transformer trunk. A DPT head reads out intermediate trunk … view at source ↗
Figure 3
Figure 3. Figure 3: Inference pipeline. Given a single input image, geometry is recovered by Depth Anything 3 and unprojected to a triangle mesh, and materials are recovered by Marigold-IID-Appearance. The textured mesh is loaded into Blender, where the user authors the desired illumination; Blender Cycles then renders the scene and produces the target intrinsic maps CT . PIXLRELIGHT takes as input the original image together… view at source ↗
Figure 4
Figure 4. Figure 4: Held-out tripod captures: paired source–target relighting. Three representative pairs from a held-out set of six indoor scenes captured under two lighting conditions each. For visual clarity we display the most recent baseline per conditioning group: UniRelight (environment-map) and V-RGBX (shading). Per-image PSNR/SSIM/LPIPS are shown below each prediction; bold marks the best method per image. PIXLRELIGH… view at source ↗
Figure 5
Figure 5. Figure 5: Relighting from path-traced illumination on DL3DV scenes [30]. Each row shows a source image, V-RGBX (the most recent shading-conditioned baseline), Blender’s full RGB render of the reconstructed scene under the authored lighting (Path Traced), and PIXLRELIGHT. V-RGBX produces plausible relightings but drifts from the source. PIXLRELIGHT transfers the authored lighting while preserving the source’s photogr… view at source ↗
Figure 6
Figure 6. Figure 6: Failure case from upstream reconstruction errors. A DL3DV scene where the single￾image depth estimator collapses the bicycle into the floor. The path-traced render carries this error into CT , and PIXLRELIGHT inherits it in the relit output. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Banner figure expansion. A single source image (left) is relit by PIXLRELIGHT (right) under six different Blender-authored illuminations. The middle column shows the corresponding path￾traced render of the reconstructed scene. PIXLRELIGHT consumes only the intrinsic buffers derived from these renders, together with the source image, and produces a sharper and more photorealistic relighting that retains the… view at source ↗
Figure 8
Figure 8. Figure 8: Additional qualitative comparisons on the MIIW test split [34]. Each row shows a single source–target pair, with predictions from the most recent baseline of each conditioning group (UniRelight for environment-map methods; V-RGBX for shading-conditioned methods, see Tab. 1) and PIXLRELIGHT. Across all twelve scenes, PIXLRELIGHT retains source detail by construction and transfers only the lighting change im… view at source ↗
Figure 9
Figure 9. Figure 9: Additional held-out tripod captures. Further source–target pairs from the held-out set, beyond the three shown in [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Additional relighting comparisons on DL3DV scenes [30]. Each row: a single source image, the most recent shading-conditioned baseline (V-RGBX), Blender’s full RGB render of the reconstructed scene under the authored lighting (Path Traced), and PIXLRELIGHT. PIXLRELIGHT transfers the authored lighting while preserving the source’s photographic detail. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Additional relighting comparisons on DL3DV scenes [30]. Each row: a single source image, the most recent shading-conditioned baseline (V-RGBX), Blender’s full RGB render of the reconstructed scene under the authored lighting (Path Traced), and PIXLRELIGHT. PIXLRELIGHT transfers the authored lighting while preserving the source’s photographic detail. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Additional ablation comparisons on the held-out tripod captures. Both variants are trained from scratch and differ from the full model in exactly one component. Intrinsics-only: the source ViT branch is removed; the source enters the network only through the modulation head. Direct regression: the modulation of eq. (4) is replaced by a sigmoid-activated RGB regression. Ours is the full model. 22 [PITH_FU… view at source ↗
read the original abstract

We present PIXLRelight, a feed-forward approach for physically controllable single-image relighting. Existing methods either provide limited lighting control (e.g. through text or environment maps), accumulate errors when chaining inverse and forward rendering, or require costly per-image optimization. Our key idea is to bridge physically based rendering (PBR) and learned image synthesis through a shared intrinsic conditioning that can be obtained from either real photographs or PBR renders. At training time, paired multi-illumination photographs are decomposed into albedo, diffuse shading, and non-diffuse residuals, which condition the model. At inference time, the same conditioning is computed from a path-traced render of a coarse 3D reconstruction of the input under user-specified PBR lights. A transformer-based neural renderer then applies the target illumination to the source photograph, preserving fine image detail through a per-pixel affine modulation. PIXLRelight enables arbitrary PBR-style lighting control, achieves state-of-the-art relighting quality, and runs in under a tenth of a second per image. Code and models are available at https://mlfarinha.github.io/pixl-relight/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents PIXLRelight, a feed-forward approach for physically controllable single-image relighting. It uses a shared intrinsic conditioning consisting of albedo, diffuse shading, and non-diffuse residuals obtained from either real multi-illumination photographs or PBR renders of coarse 3D reconstructions. A transformer-based neural renderer applies the target illumination to the source image via per-pixel affine modulation. The method claims to enable arbitrary PBR-style lighting control, achieve state-of-the-art quality, and run in under 0.1 seconds per image.

Significance. This work has the potential to advance controllable relighting techniques by avoiding the need for per-image optimization and error accumulation in chained rendering pipelines. By bridging PBR and learned synthesis through intrinsic conditioning, it could facilitate real-time applications if the domain compatibility is validated. The open-sourcing of code and models strengthens its contribution to the community.

major comments (2)
  1. [Section 3.2] The decomposition network is trained only on real data (as described in Section 3.2 and Figure 3), yet the same network is applied at inference to synthetic path-traced renders. This may result in out-of-distribution inputs due to approximations in geometry, materials, and shadows in the coarse 3D reconstruction, potentially leading to inaccurate relighting. The paper should include experiments validating the compatibility of decompositions from real and synthetic sources.
  2. [Abstract and Results] The abstract asserts state-of-the-art relighting quality and real-time performance, but the manuscript provides no quantitative metrics, baseline comparisons, or ablation studies to support these claims. Without such evidence, the central performance assertions cannot be fully evaluated.
minor comments (2)
  1. [Figure 3] Improve the clarity of the architecture diagram by adding more detailed labels for the transformer renderer and modulation steps.
  2. [Abstract] The claim of 'state-of-the-art relighting quality' should be qualified or supported with specific references to comparisons in the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment in detail below, indicating planned changes to the manuscript.

read point-by-point responses
  1. Referee: [Section 3.2] The decomposition network is trained only on real data (as described in Section 3.2 and Figure 3), yet the same network is applied at inference to synthetic path-traced renders. This may result in out-of-distribution inputs due to approximations in geometry, materials, and shadows in the coarse 3D reconstruction, potentially leading to inaccurate relighting. The paper should include experiments validating the compatibility of decompositions from real and synthetic sources.

    Authors: We agree that the decomposition network is trained exclusively on real multi-illumination photographs, as stated in Section 3.2, and that applying it to synthetic path-traced renders of coarse 3D reconstructions introduces a potential domain gap due to differences in geometry, materials, and shadow approximations. Although the shared intrinsic conditioning (albedo, diffuse shading, and non-diffuse residuals) is intended to provide a domain-bridging representation, we acknowledge that explicit validation would strengthen the approach. We will add experiments to the revised manuscript that compare decomposition outputs on matched real and synthetic versions of the same scenes, including visual and quantitative consistency metrics where feasible. revision: yes

  2. Referee: [Abstract and Results] The abstract asserts state-of-the-art relighting quality and real-time performance, but the manuscript provides no quantitative metrics, baseline comparisons, or ablation studies to support these claims. Without such evidence, the central performance assertions cannot be fully evaluated.

    Authors: We thank the referee for highlighting this point. The abstract summarizes the relighting quality and runtime claims based on the qualitative visual results, comparisons to prior work, and timing benchmarks presented in the Results section. We recognize, however, that quantitative metrics (e.g., PSNR, SSIM on relighting benchmarks), direct numerical baseline comparisons, and ablation studies on components such as the transformer renderer and per-pixel modulation would provide stronger substantiation. We will incorporate these quantitative evaluations, baseline tables, and ablations into the revised Results section. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper presents a feed-forward transformer renderer that applies target illumination to a source photograph using intrinsic conditioning (albedo, diffuse shading, non-diffuse residuals). Training uses decompositions from real paired multi-illumination photographs, while inference computes the same conditioning from path-traced renders of a coarse 3D reconstruction under user-specified PBR lights. No equations, derivations, or self-citations reduce the output to a fitted quantity defined from the input by construction. The method relies on external PBR rendering and a standard decomposition network without per-image optimization or domain adaptation, making the central claim independent of circular reductions. The distributional gap between real and synthetic decompositions is an assumption about generalization rather than a definitional loop.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on the domain assumption that reliable intrinsic decompositions can be obtained from both real photographs and rendered images, plus the modeling choice that a coarse 3D reconstruction is adequate for generating target lighting conditioning. No explicit free parameters or new invented entities are named in the abstract.

axioms (2)
  • domain assumption Paired multi-illumination photographs can be decomposed into albedo, diffuse shading, and non-diffuse residuals with sufficient accuracy for training.
    This decomposition is used to create the conditioning signal at training time.
  • domain assumption A coarse 3D reconstruction of the input scene is sufficient to produce a path-traced render whose intrinsic components match the format needed for lighting transfer.
    This is the key step that enables inference-time control.

pith-pipeline@v0.9.0 · 5729 in / 1587 out tokens · 58162 ms · 2026-05-20T11:10:02.664335+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 5 internal anchors

  1. [1]

    Recovering intrinsic scene characteristics.Comput

    Harry Barrow, J Tenenbaum, A Hanson, and E Riseman. Recovering intrinsic scene characteristics.Comput. vis. syst, 2(3-26):2, 1978

  2. [2]

    Genlit: Reformulating single-image relighting as video generation

    Shrisha Bharadwaj, Haiwen Feng, Giorgio Becherini, Victoria Fernandez Abrevaya, and Michael J Black. Genlit: Reformulating single-image relighting as video generation. InProceedings of the SIGGRAPH Asia 2025 Conference Papers, pages 1–12, 2025

  3. [3]

    Blender Foundation

    Blender Online Community.Blender - a 3D modelling and rendering package. Blender Foundation

  4. [4]

    Intrinsic image decomposition via ordinal shading.ACM Transactions on Graphics, 43(1):1–24, 2023

    Chris Careaga and Ya˘gız Aksoy. Intrinsic image decomposition via ordinal shading.ACM Transactions on Graphics, 43(1):1–24, 2023

  5. [5]

    Physically controllable relighting of photographs

    Chris Careaga and Ya˘gız Aksoy. Physically controllable relighting of photographs. InProceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, pages 1–10, 2025

  6. [6]

    Diffusionlight-turbo: Accelerated light probes for free via single-pass chrome ball inpainting.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

    Worameth Chinchuthakun, Pakkapon Phongthawee, Amit Raj, Varun Jampani, Pramook Khungurn, and Supasorn Suwajanakorn. Diffusionlight-turbo: Accelerated light probes for free via single-pass chrome ball inpainting.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

  7. [7]

    Scribblelight: Single image indoor relighting with scribbles

    Jun Myeong Choi, Annie Wang, Pieter Peers, Anand Bhattad, and Roni Sengupta. Scribblelight: Single image indoor relighting with scribbles. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5720–5731, 2025

  8. [8]

    Vision Transformers Need Registers

    Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. arXiv preprint arXiv:2309.16588, 2023

  9. [9]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

  10. [10]

    Unreal engine

    Epic Games. Unreal engine

  11. [11]

    V-rgbx: Video editing with accurate controls over intrinsic properties.arXiv preprint arXiv:2512.11799, 2025

    Ye Fang, Tong Wu, Valentin Deschaintre, Duygu Ceylan, Iliyan Georgiev, Chun-Hao Paul Huang, Yiwei Hu, Xuelin Chen, and Tuanfeng Yang Wang. V-rgbx: Video editing with accurate controls over intrinsic properties.arXiv preprint arXiv:2512.11799, 2025

  12. [12]

    Ground truth dataset and baseline evaluations for intrinsic image algorithms

    Roger Grosse, Micah K Johnson, Edward H Adelson, and William T Freeman. Ground truth dataset and baseline evaluations for intrinsic image algorithms. In2009 IEEE 12th International Conference on Computer Vision, pages 2335–2342. Ieee, 2009

  13. [13]

    Cambridge university press, 2003

    Richard Hartley and Andrew Zisserman.Multiple view geometry in computer vision. Cambridge university press, 2003

  14. [14]

    Unirelight: Learning joint decomposition and synthesis for video relighting.arXiv preprint arXiv:2506.15673, 2025

    Kai He, Ruofan Liang, Jacob Munkberg, Jon Hasselgren, Nandita Vijaykumar, Alexander Keller, Sanja Fidler, Igor Gilitschenski, Zan Gojcic, and Zian Wang. Unirelight: Learning joint decomposition and synthesis for video relighting.arXiv preprint arXiv:2506.15673, 2025

  15. [15]

    Vidit: Virtual image dataset for illumination transfer.arXiv preprint arXiv:2005.05460, 2020

    Majed El Helou, Ruofan Zhou, Johan Barthas, and Sabine Süsstrunk. Vidit: Virtual image dataset for illumination transfer.arXiv preprint arXiv:2005.05460, 2020

  16. [16]

    Rayzer: A self-supervised large view synthesis model

    Hanwen Jiang, Hao Tan, Peng Wang, Haian Jin, Yue Zhao, Sai Bi, Kai Zhang, Fujun Luan, Kalyan Sunkavalli, Qixing Huang, et al. Rayzer: A self-supervised large view synthesis model. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4918–4929, 2025

  17. [17]

    Lvsm: A large view synthesis model with minimal 3d inductive bias.arXiv preprint arXiv:2410.17242, 2024

    Haian Jin, Hanwen Jiang, Hao Tan, Kai Zhang, Sai Bi, Tianyuan Zhang, Fujun Luan, Noah Snavely, and Zexiang Xu. Lvsm: A large view synthesis model with minimal 3d inductive bias.arXiv preprint arXiv:2410.17242, 2024

  18. [18]

    Neural gaffer: Relighting any object via diffusion.Advances in Neural Information Processing Systems, 37:141129–141152, 2024

    Haian Jin, Yuan Li, Fujun Luan, Yuanbo Xiangli, Sai Bi, Kai Zhang, Zexiang Xu, Jin Sun, and Noah Snavely. Neural gaffer: Relighting any object via diffusion.Advances in Neural Information Processing Systems, 37:141129–141152, 2024

  19. [19]

    Perceptual losses for real-time style transfer and super-resolution

    Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. InEuropean conference on computer vision, pages 694–711. Springer, 2016

  20. [20]

    Marigold: Affordable adaptation of diffusion-based image generators for image analysis, 2025

    Bingxin Ke, Kevin Qu, Tianfu Wang, Nando Metzger, Shengyu Huang, Bo Li, Anton Obukhov, and Konrad Schindler. Marigold: Affordable adaptation of diffusion-based image generators for image analysis, 2025. 10

  21. [21]

    Intrinsic image diffusion for indoor single-view material estimation

    Peter Kocsis, Vincent Sitzmann, and Matthias Nießner. Intrinsic image diffusion for indoor single-view material estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5198–5208, 2024

  22. [22]

    Intrinsix: High-quality pbr generation using image priors.arXiv preprint arXiv:2504.01008, 2025

    Peter Kocsis, Lukas Höllein, and Matthias Nießner. Intrinsix: High-quality pbr generation using image priors.arXiv preprint arXiv:2504.01008, 2025

  23. [23]

    Lightness and retinex theory.Journal of the Optical society of America, 61(1):1–11, 1971

    Edwin H Land and John J McCann. Lightness and retinex theory.Journal of the Optical society of America, 61(1):1–11, 1971

  24. [24]

    Grounding image matching in 3d with mast3r

    Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with mast3r. In European conference on computer vision, pages 71–91. Springer, 2024

  25. [25]

    Learning intrinsic image decomposition from watching the world

    Zhengqi Li and Noah Snavely. Learning intrinsic image decomposition from watching the world. In Computer Vision and Pattern Recognition (CVPR), 2018

  26. [26]

    Inverse rendering for complex indoor scenes: Shape, spatially-varying lighting and svbrdf from a single image

    Zhengqin Li, Mohammad Shafiei, Ravi Ramamoorthi, Kalyan Sunkavalli, and Manmohan Chandraker. Inverse rendering for complex indoor scenes: Shape, spatially-varying lighting and svbrdf from a single image. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2475–2484, 2020

  27. [27]

    Openrooms: An open framework for photorealistic indoor scene datasets

    Zhengqin Li, Ting-Wei Yu, Shen Sang, Sarah Wang, Meng Song, Yuhan Liu, Yu-Ying Yeh, Rui Zhu, Nitesh Gundavarapu, Jia Shi, et al. Openrooms: An open framework for photorealistic indoor scene datasets. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7190–7199, 2021

  28. [28]

    Diffusion renderer: Neural inverse and forward rendering with video diffusion models

    Ruofan Liang, Zan Gojcic, Huan Ling, Jacob Munkberg, Jon Hasselgren, Chih-Hao Lin, Jun Gao, Alexander Keller, Nandita Vijaykumar, Sanja Fidler, et al. Diffusion renderer: Neural inverse and forward rendering with video diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26069–26080, 2025

  29. [29]

    Depth Anything 3: Recovering the Visual Space from Any Views

    Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

  30. [30]

    Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision

    Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22160–22169, 2024

  31. [31]

    A convnet for the 2020s

    Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11976–11986, 2022

  32. [32]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

  33. [33]

    Lightlab: Controlling light sources in images with diffusion models

    Nadav Magar, Amir Hertz, Eric Tabellion, Yael Pritch, Alex Rav-Acha, Ariel Shamir, and Yedid Hoshen. Lightlab: Controlling light sources in images with diffusion models. InProceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, pages 1–11, 2025

  34. [34]

    A dataset of multi-illumination images in the wild

    Lukas Murmann, Michael Gharbi, Miika Aittala, and Fredo Durand. A dataset of multi-illumination images in the wild. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4080–4089, 2019

  35. [35]

    Vision transformers for dense prediction

    René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021

  36. [36]

    Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding

    Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M Susskind. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. InProceedings of the IEEE/CVF international conference on computer vision, pages 10912–10922, 2021

  37. [37]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

  38. [38]

    SyncLight: Single-Edit Multi-View Relighting

    David Serrano-Lozano, Anand Bhattad, Luis Herranz, Jean-François Lalonde, and Javier Vazquez-Corral. Synclight: Controllable and consistent multi-view relighting.arXiv preprint arXiv:2601.16981, 2026. 11

  39. [39]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

  40. [40]

    Ouroboros: Single-step diffusion models for cycle-consistent forward and inverse rendering

    Shanlin Sun, Yifan Wang, Hanwen Zhang, Yifeng Xiong, Qin Ren, Ruogu Fang, Xiaohui Xie, and Chenyu You. Ouroboros: Single-step diffusion models for cycle-consistent forward and inverse rendering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10386–10397, 2025

  41. [41]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  42. [42]

    Vggt: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025

  43. [43]

    Dust3r: Geometric 3d vision made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20697–20709, 2024

  44. [44]

    Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

  45. [45]

    Learning indoor inverse rendering with 3d spatially-varying lighting

    Zian Wang, Jonah Philion, Sanja Fidler, and Jan Kautz. Learning indoor inverse rendering with 3d spatially-varying lighting. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12538–12547, 2021

  46. [46]

    Convnext v2: Co-designing and scaling convnets with masked autoencoders

    Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, and Saining Xie. Convnext v2: Co-designing and scaling convnets with masked autoencoders. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16133–16142, 2023

  47. [47]

    Luminet: Latent intrinsics meets diffusion models for indoor scene relighting

    Xiaoyan Xing, Konrad Groh, Sezer Karaoglu, Theo Gevers, and Anand Bhattad. Luminet: Latent intrinsics meets diffusion models for indoor scene relighting. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 442–452, 2025

  48. [48]

    Depth anything v2.Advances in Neural Information Processing Systems, 37:21875–21911, 2024

    Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything v2.Advances in Neural Information Processing Systems, 37:21875–21911, 2024

  49. [49]

    Dilightnet: Fine-grained lighting control for diffusion-based image generation

    Chong Zeng, Yue Dong, Pieter Peers, Youkang Kong, Hongzhi Wu, and Xin Tong. Dilightnet: Fine-grained lighting control for diffusion-based image generation. InACM SIGGRAPH 2024 Conference Papers, pages 1–12, 2024

  50. [50]

    Rgb↔x: Image decomposition and synthesis using material- and lighting-aware diffusion models

    Zheng Zeng, Valentin Deschaintre, Iliyan Georgiev, Yannick Hold-Geoffroy, Yiwei Hu, Fujun Luan, Ling- Qi Yan, and Miloš Hašan. Rgb↔x: Image decomposition and synthesis using material- and lighting-aware diffusion models. InACM SIGGRAPH 2024 Conference Papers, New York, NY , USA, 2024. Association for Computing Machinery

  51. [51]

    Scaling in-the-wild training for diffusion-based illumi- nation harmonization and editing by imposing consistent light transport

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Scaling in-the-wild training for diffusion-based illumi- nation harmonization and editing by imposing consistent light transport. InThe Thirteenth International Conference on Learning Representations, 2025

  52. [52]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018

  53. [53]

    Irisformer: Dense vision transformers for single-image inverse rendering in indoor scenes

    Rui Zhu, Zhengqin Li, Janarbek Matai, Fatih Porikli, and Manmohan Chandraker. Irisformer: Dense vision transformers for single-image inverse rendering in indoor scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2822–2831, 2022. 12 Appendix This appendix collects implementation details and additional results...