arxiv: 2605.11696 · v1 · submitted 2026-05-12 · 💻 cs.CV · cs.AI· cs.GR

Recognition: 2 theorem links

· Lean Theorem

WildRelight: A Real-World Benchmark and Physics-Guided Adaptation for Single-Image Relighting

Lezhong Wang , Mehmet Onurcan Kaya , Siavash Bigdeli , Jeppe Revall Frisvad

Authors on Pith no claims yet

Pith reviewed 2026-05-13 06:25 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.GR

keywords relightingmodelssingle-imagewildrelightadaptationdatasetsyntheticaligned

0 comments

The pith

WildRelight dataset lets synthetic relighting models adapt to real outdoor scenes using only temporal lighting changes as supervision.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents WildRelight as the first in-the-wild dataset of high-resolution outdoor scenes captured under strictly aligned sequences of natural illumination, each paired with an HDR environment map. It demonstrates that current single-image relighting models trained on synthetic data suffer large domain shifts when applied to these real scenes. The authors then show that the dataset's temporal alignment supplies a self-supervised signal that can be used at inference time to adapt models to real statistics. By combining diffusion posterior sampling with sampling-aware test-time adaptation, the framework updates the model on the fly without any ground-truth relit images. Readers should care because the work both quantifies the sim-to-real gap and supplies a concrete, physics-guided way to close it for practical relighting.

Core claim

WildRelight supplies strictly aligned temporal sequences of real outdoor scenes under varying natural illuminations together with HDR environment maps; this structure enables a physics-guided inference method that integrates Diffusion Posterior Sampling with temporal Sampling-Aware Test-Time Adaptation, allowing synthetic models to align with real-world lighting statistics on the fly and turning the sim-to-real problem into a tractable self-supervised task.

What carries the argument

Strictly aligned temporal structure of the WildRelight scenes, serving as a self-supervised constraint inside a DPS-plus-temporal-TTA inference framework.

If this is right

State-of-the-art synthetic relighting models exhibit severe domain shifts on real-world data.
The temporal structure enables effective self-supervised adaptation without ground-truth relit images.
The same dataset functions as a rigorous benchmark for measuring progress in single-image relighting.
The DPS-TTA combination converts an intractable domain-shift problem into an on-the-fly self-supervised task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar temporal or multi-frame constraints could be exploited for domain adaptation in other single-image translation tasks such as dehazing or colorization.
Training future relighting networks with built-in test-time adaptation modules might reduce the need for separate inference-time tuning.
Releasing the dataset publicly will let researchers test whether the adaptation generalizes beyond the outdoor scenes used here.
Applying the same framework to indoor or synthetic-to-real indoor data would test whether the temporal self-supervision depends on natural outdoor light variation.
keywords:[
single-image relighting
real-world dataset
domain adaptation

Load-bearing premise

The captured scenes maintain perfect temporal alignment that supplies a reliable self-supervised signal for domain adaptation.

What would settle it

A direct comparison showing that models adapted with the temporal TTA produce lighting that is inconsistent across the sequence or mismatches the supplied HDR environment maps would falsify the adaptation claim.

Figures

Figures reproduced from arXiv: 2605.11696 by Jeppe Revall Frisvad, Lezhong Wang, Mehmet Onurcan Kaya, Siavash Bigdeli.

**Figure 1.** Figure 1: Example image and illumination pairs from our WildRelight dataset Despite this remarkable progress, a critical question remains unanswered: how well do these models perform outside the sanitized confines of synthetic data? The training and, more importantly, the quantitative evaluation of most inverse rendering models are predominantly conducted on synthetic datasets [14,17,38,41]. While invaluable for dev… view at source ↗

**Figure 2.** Figure 2: We selected some example scenes from the dataset. We meticulously curated the dataset to represent a diverse range of challenging scenarios. The collected scenes encompass complex environmental conditions including tree structures, transparent glass surfaces, and reflective glass materials. For each scene, temporal variations are captured through photographs taken at different time intervals, accompanied b… view at source ↗

**Figure 3.** Figure 3: (a) Capture settings. Both the Insta360 Pro2 and Sony A7M2 are mounted on a rail system to enable front-to-back adjustments for precise alignment with the nodal point. (b) Photos of Xrite ColorChecker are used to calibrate color between main camera Sony A7 and envmap camera Insta360 Pro2 Before applying color calibration After applying color calibration [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison examples of color calibration effects. Due to the inherently sophisticated color science in professional grade products such as the Sony A7 and Insta360 Pro2, visual differences between pre- and post-color calibration images are negligible. To quantitatively analyze the calibration effects, a histogram was incorporated into the lower left corner of the image. While visual inspection reveals m… view at source ↗

**Figure 5.** Figure 5: Example of Dynamic Scene Elements. As showed in the figure, the left and right images were captured at different times with a fixed camera position. Despite the static camera setup, subtle movements of the leaves occur due to external factors such as wind. To address this, we manually created masks for these dynamic regions, allowing researchers to determine whether to include them when computing metric… view at source ↗

**Figure 6.** Figure 6: Qualitative Comparison of Different Methods on the Test Set Relighting. The numerical values displayed beneath the image correspond to the following metrics: SSIM, PSNR, and LPIPS. In zero shot scenarios, both DiffusionRenderer [17] and RGB↔X [38] struggle to accurately render image brightness. After finetuning, DiffusionRenderer demonstrates improved alignment with GT in the test set, achieving more accur… view at source ↗

**Figure 7.** Figure 7: Qualitative Ablation Study Results. We visualize the results of our proposed framework on the WildRelight dataset. The envmap on bottom left of GT image correspond to target illuminations used for relighting. protocol: for N lightings, one is the test target while the other N − 1 serve as self-supervised signals. This rigorously simulates real-world deployment without ground-truth supervision. Ablation Stu… view at source ↗

**Figure 8.** Figure 8: Distribution of capture time differences (∆t). The histogram shows the frequency of absolute time delays between the scene image and the environment map capture across the dataset. The distribution is heavily right-skewed, with the vast majority of samples having a delay of less than 20 seconds, confirming that large delays are rare. 1.3 Impact on Relighting Tasks Modern single-image relighting algorithms … view at source ↗

**Figure 9.** Figure 9: Advantage of RAW format with HDR. When using HDR photo, the details of the photo are preserved. Therefore, by adjusting the exposure settings, the original colors of the photo can be accurately restored. The data recorded by a camera sensor in RAW format exhibits a fundamentally linear relationship with the light intensity of the actual scene. This linearity is a core advantage in computational photography… view at source ↗

**Figure 10.** Figure 10: Side-by-side camera setup, a non-aligned envmap camera will record direct sun light, but scene camera records a shadow. 3. Optical Artifacts: 360◦ cameras rely on heavy distortion correction and stitching algorithms, which introduce resampling artifacts. Using a dedicated rectilinear lens ensures the benchmark data is free from such algorithmic interference. 5.2 Necessity of Strict Spatial Alignment Preci… view at source ↗

**Figure 11.** Figure 11: Nodal point alignment. When the camera is not positioned at the nodal point, rotating the camera causes the nearby utility pole to fail to occlude the poles behind it. In contrast, when the camera is at the nodal point, the nearby utility pole can occlude the poles behind it. The empirical determination of the entrance pupil’s location, or the noparallax point, is a foundational procedure in panoramic ph… view at source ↗

**Figure 12.** Figure 12: More examples from our dataset 6.3 Details of Dynamic Scene Elements Annotation A significant challenge in capturing longitudinal, “in-the-wild" datasets is the presence of dynamic scene elements. While our capture rig ensures a static viewpoint, the long temporal intervals between acquisitions mean that elements such as wind blown foliage, grass, and cloud formations inevitably move. Although computation… view at source ↗

**Figure 13.** Figure 13: UI Interface for Marking Dynamic Scene Elements. Volunteers utilize a brush tool to create masks or an eraser tool to remove incorrectly marked regions. In the illustrated example, the areas indicated by red arrows correspond to discrepancies between the two photographs, highlighting regions where inconsistencies exist. Our annotation pipeline is as follows: 1. Pairwise Comparison: For each scene, annotat… view at source ↗

read the original abstract

Recent single-image relighting methods, powered by advanced generative models, have achieved impressive photorealism on synthetic benchmarks. However, their effectiveness in the complex visual landscape of the real world remains largely unverified. A critical gap exists, as current datasets are typically designed for multi-view reconstruction and fail to address the unique challenges of single-image relighting. To bridge this synthetic-to-real gap, we introduce WildRelight, the first in-the-wild dataset specifically created for evaluating single-image relighting models. WildRelight features a diverse collection of high-resolution outdoor scenes, captured under strictly aligned, temporally varying natural illuminations, each paired with a high-dynamic-range environment map. Using this data, we establish a rigorous benchmark revealing that state-of-the-art models trained on synthetic data suffer from severe domain shifts. The strictly aligned temporal structure of WildRelight enables a new paradigm for domain adaptation. We demonstrate this by introducing a physics-guided inference framework that leverages the captured natural light evolution as a self-supervised constraint. By integrating Diffusion Posterior Sampling (DPS) with temporal Sampling-Aware Test-Time Adaptation (TTA), we show that the dataset allows synthetic models to align with real-world statistics on-the-fly, transforming the intractable sim-to-real challenge into a tractable self-supervised task. The dataset and code will be made publicly available to foster robust, physically-grounded relighting research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WildRelight supplies a practical new real-world dataset for single-image relighting evaluation, but the self-supervised adaptation claims rest on an untested stationarity assumption that needs stronger evidence.

read the letter

The main takeaway is that this paper delivers the first in-the-wild dataset built specifically for single-image relighting, with temporally aligned outdoor captures and paired HDR maps. That alone addresses a clear gap left by multi-view reconstruction datasets. The authors also show synthetic models suffer domain shift on it and sketch a DPS-plus-temporal-TTA adaptation that treats the natural light changes as a self-supervised signal to pull models toward real statistics without ground-truth relit images. Collecting strictly aligned temporal sequences in the wild is non-trivial work, and releasing the data and code is the right move for the field. The benchmark part is straightforward and useful for anyone testing relighting methods on real scenes. The adaptation idea is a reasonable attempt to turn the sim-to-real problem into something tractable using physics-guided sampling. The soft spot is the stationarity assumption behind the temporal constraint. Outdoor scenes often include wind on foliage, shifting cloud shadows, or small camera motion that introduce geometric or reflectance changes indistinguishable from illumination shifts under the proposed loss. If those artifacts are present, the adaptation signal gets noisy, and the on-the-fly alignment may not generalize. The abstract asserts success but gives no numbers, baseline comparisons, or ablation details, so it is hard to tell how much the method actually closes the gap versus how much it fits the specific capture conditions. This work is aimed at vision and graphics researchers who need real-world relighting benchmarks or are exploring test-time adaptation for generative models. The dataset contribution is solid enough that a serious referee should see it, even if the experiments require expansion and the stationarity issue needs direct testing or mitigation. I would send it to review rather than desk reject.

Referee Report

3 major / 2 minor

Summary. The paper introduces WildRelight, the first in-the-wild dataset for single-image relighting consisting of high-resolution outdoor scenes captured under strictly aligned, temporally varying natural illuminations, each paired with an HDR environment map. It establishes a benchmark showing severe domain shifts in synthetic-trained SOTA models and proposes a physics-guided inference framework that integrates Diffusion Posterior Sampling (DPS) with temporal Sampling-Aware Test-Time Adaptation (TTA) to enable on-the-fly self-supervised adaptation of synthetic models to real-world statistics using the temporal structure as constraint.

Significance. If the adaptation framework holds, the work supplies both a needed real-world benchmark for single-image relighting and a self-supervised adaptation paradigm that could convert the sim-to-real gap into a tractable task without requiring ground-truth relit images, potentially improving generalization of generative relighting methods to outdoor scenes.

major comments (3)

Abstract and §4 (method description): the central claim that DPS+TTA enables synthetic models to 'align with real-world statistics on-the-fly' and transforms the sim-to-real challenge into a 'tractable self-supervised task' is unsupported by any quantitative metrics, baseline comparisons, ablation studies, or implementation details in the provided text, rendering the claim unverifiable.
Abstract and §3 (dataset): the adaptation method rests on the assumption that 'strictly aligned temporal structure' supplies a reliable self-supervised constraint with differences due solely to illumination; no analysis, statistics, or validation is supplied to confirm absence of non-illumination dynamics (foliage motion, cloud shadows, specular changes, or sub-pixel drift) that would corrupt the DPS posterior sampling and TTA objective.
§5 (experiments): without reported numbers on adaptation performance (e.g., PSNR, LPIPS, or perceptual metrics before/after TTA on held-out real frames), it is impossible to assess whether the physics-guided loss actually drives alignment or merely fits to capture artifacts.

minor comments (2)

Abstract: the phrase 'the dataset allows synthetic models to align...' should be rephrased to clarify that the alignment is demonstrated via the proposed method rather than being an intrinsic property of the data alone.
Notation: the distinction between 'Sampling-Aware Test-Time Adaptation (TTA)' and standard TTA is introduced without an equation or pseudocode block; a short algorithmic outline would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thorough and constructive review. We appreciate the acknowledgment of WildRelight's potential as the first in-the-wild benchmark for single-image relighting and the value of the proposed adaptation paradigm. We address each major comment point by point below and will revise the manuscript accordingly to strengthen the presentation and verifiability of our claims.

read point-by-point responses

Referee: Abstract and §4 (method description): the central claim that DPS+TTA enables synthetic models to 'align with real-world statistics on-the-fly' and transforms the sim-to-real challenge into a 'tractable self-supervised task' is unsupported by any quantitative metrics, baseline comparisons, ablation studies, or implementation details in the provided text, rendering the claim unverifiable.

Authors: We agree that the current version relies on qualitative demonstrations to support the adaptation claims. To address this, the revised manuscript will expand §4 with full implementation details of the DPS+TTA integration, including hyperparameters and sampling procedures. We will also add quantitative results, baseline comparisons, and ablation studies to §5, providing metrics that directly verify the on-the-fly alignment with real-world statistics. revision: yes
Referee: Abstract and §3 (dataset): the adaptation method rests on the assumption that 'strictly aligned temporal structure' supplies a reliable self-supervised constraint with differences due solely to illumination; no analysis, statistics, or validation is supplied to confirm absence of non-illumination dynamics (foliage motion, cloud shadows, specular changes, or sub-pixel drift) that would corrupt the DPS posterior sampling and TTA objective.

Authors: The capture protocol was designed with fixed-camera, short-interval sequences to isolate illumination as the dominant variable. We acknowledge the need for explicit validation. In the revision, §3 will include a new analysis subsection with alignment statistics (e.g., sub-pixel registration error), optical-flow-based quantification of non-illumination motion, and examples demonstrating that such factors remain negligible relative to lighting changes, thereby supporting the self-supervised constraint. revision: yes
Referee: §5 (experiments): without reported numbers on adaptation performance (e.g., PSNR, LPIPS, or perceptual metrics before/after TTA on held-out real frames), it is impossible to assess whether the physics-guided loss actually drives alignment or merely fits to capture artifacts.

Authors: We concur that numerical before/after evaluation is required to substantiate the adaptation efficacy. The current experiments emphasize the benchmark and qualitative results. The revised §5 will report PSNR, LPIPS, and additional perceptual metrics on held-out real frames, comparing performance prior to and after TTA, to demonstrate that the physics-guided objective improves alignment beyond artifact fitting. revision: yes

Circularity Check

0 steps flagged

No significant circularity; new dataset with established DPS+TTA adaptation

full rationale

The paper introduces WildRelight as an independent data collection with strictly aligned temporal frames and HDR maps. It then applies the pre-existing Diffusion Posterior Sampling (DPS) and Sampling-Aware Test-Time Adaptation (TTA) frameworks to leverage that temporal structure as a self-supervised signal. No equations are shown that define a quantity in terms of itself, rename a fitted parameter as a prediction, or reduce the alignment result to a self-citation chain. The temporal stationarity assumption is an external modeling choice about the data, not a definitional loop. The derivation therefore remains self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the dataset's temporal alignment property and the assumption that natural light evolution supplies a usable self-supervised signal for adaptation.

axioms (1)

domain assumption Strictly aligned temporal captures under varying natural illuminations provide a valid self-supervised constraint for domain adaptation.
Invoked to justify the physics-guided inference framework.

pith-pipeline@v0.9.0 · 5571 in / 1156 out tokens · 97547 ms · 2026-05-13T06:25:56.255915+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

By integrating Diffusion Posterior Sampling (DPS) with temporal Sampling-Aware Test-Time Adaptation (TTA), we show that the dataset allows synthetic models to align with real-world statistics on-the-fly

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages

[1]

Aksoy, Y., Kim, C., Kellnhofer, P., Paris, S., Elgharib, M., Pollefeys, M., Matusik, W.:Adatasetofflashandambientilluminationpairsfromthecrowd.In:Proceedings of the European Conference on Computer Vision (ECCV). pp. 634–649 (2018)

work page 2018
[2]

IEEE Transactions on Pattern Analysis and Machine Intelligence37(8), 1670–1687 (2014)

Barron, J.T., Malik, J.: Shape, illumination, and reflectance from shading. IEEE Transactions on Pattern Analysis and Machine Intelligence37(8), 1670–1687 (2014)

work page 2014
[3]

In: Proceedings of International Conference on Computer Vision (ICCV)

Boss, M., Braun, R., Jampani, V., Barron, J.T., Liu, C., Lensch, H.: NeRD: Neural reflectance decomposition from image collections. In: Proceedings of International Conference on Computer Vision (ICCV). pp. 12684–12694 (2021)

work page 2021
[4]

In: The Eleventh International Con- ference on Learning Representations (2023),https://openreview.net/forum?id= OnD9zGAGT0k

Chung, H., Kim, J., Mccann, M.T., Klasky, M.L., Ye, J.C.: Diffusion posterior sampling for general noisy inverse problems. In: The Eleventh International Con- ference on Learning Representations (2023),https://openreview.net/forum?id= OnD9zGAGT0k

work page 2023
[5]

ACM Transactions on Graphics (ToG)1(1), 7–24 (1982)

Cook, R.L., Torrance, K.E.: A reflectance model for computer graphics. ACM Transactions on Graphics (ToG)1(1), 7–24 (1982)

work page 1982
[6]

In: Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques

Debevec, P.E., Malik, J.: Recovering high dynamic range radiance maps from photographs. In: Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques. p. 369–378. SIGGRAPH ’97, ACM Press/Addison- Wesley Publishing Co., USA (1997).https://doi.org/10.1145/258734.258884, https://doi.org/10.1145/258734.258884

work page doi:10.1145/258734.258884 1997
[7]

In: Proceedings of SIGGRAPH ’96

Debevec, P.E., Taylor, C.J., Malik, J.: Modeling and rendering architecture from photographs: a hybrid geometry-and image-based approach. In: Proceedings of SIGGRAPH ’96. pp. 11–20 (1996)

work page 1996
[8]

In: Proceedings of International Conference on Computer Vision (ICCV)

Haque, A., Tancik, M., Efros, A.A., Holynski, A., Kanazawa, A.: Instruct- NeRF2NeRF: Editing 3D scenes with instructions. In: Proceedings of International Conference on Computer Vision (ICCV). pp. 19740–19750 (2023)

work page 2023
[9]

ICLR1(2), 3 (2022)

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: LoRA: Low-rank adaptation of large language models. ICLR1(2), 3 (2022)

work page 2022
[10]

Jakob, W., Speierer, S., Roussel, N., Nimier-David, M., Vicini, D., Zeltner, T., Nico- let, B., Crespo, M., Leroy, V., Zhang, Z.: Mitsuba 3 renderer (2022), https://mitsuba- renderer.org

work page 2022
[11]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Jensen, R., Dahl, A., Vogiatzis, G., Tola, E., Aanæs, H.: Large scale multi-view stereopsis evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 406–413 (2014)

work page 2014
[12]

ACM Transactions on Graphics42(4), 139 (July 2023),https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/

Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3D Gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics42(4), 139 (July 2023),https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/

work page 2023
[13]

Advances in Neural Information Processing Systems36, 46938–46957 (2023)

Kuang, Z., Zhang, Y., Yu, H.X., Agarwala, S., Wu, E., Wu, J., et al.: Stanford- ORB: a real-world 3D object inverse rendering benchmark. Advances in Neural Information Processing Systems36, 46938–46957 (2023)

work page 2023
[14]

In: Proceedings of Computer Vision and Pattern Recognition (CVPR)

Li, Z., Wang, L., Huang, X., Pan, C., Yang, J.: PhyIR: Physics-based inverse rendering for panoramic indoor images. In: Proceedings of Computer Vision and Pattern Recognition (CVPR). pp. 12713–12723 (2022)

work page 2022
[15]

In: European Conference on Computer Vision (ECCV)

Li, Z., Shi, J., Bi, S., Zhu, R., Sunkavalli, K., Hašan, M., Xu, Z., Ramamoorthi, R., Chandraker, M.: Physically-based editing of indoor scene lighting from a single image. In: European Conference on Computer Vision (ECCV). pp. 555–572. Springer (2022)

work page 2022
[16]

Wang et al

Li, Z., Yu, T.W., Sang, S., Wang, S., Song, M., Liu, Y., Yeh, Y.Y., Zhu, R., Gundavarapu, N., Shi, J., Bi, S., Yu, H.X., Xu, Z., Sunkavalli, K., Hasan, M., Ra- mamoorthi, R., Chandraker, M.: OpenRooms: An open framework for photorealistic 16 LZ. Wang et al. indoor scene datasets. In: Proceedings of Computer Vision and Pattern Recognition (CVPR). pp. 7190–...

work page 2021
[17]

In: Proceedings of Computer Vision and Pattern Recognition (CVPR)

Liang, R., Gojcic, Z., Ling, H., Munkberg, J., Hasselgren, J., Lin, C.H., Gao, J., Keller, A., Vijaykumar, N., Fidler, S., et al.: Diffusion renderer: Neural inverse and forward rendering with video diffusion models. In: Proceedings of Computer Vision and Pattern Recognition (CVPR). pp. 26069–26080 (2025)

work page 2025
[18]

Advances in Neural Information Processing Systems36, 36951–36962 (2023)

Liu, I., Chen, L., Fu, Z., Wu, L., Jin, H., Li, Z., Wong, C.M.R., Xu, Y., Ramamoorthi, R., Xu, Z., et al.: OpenIllumination: A multi-illumination dataset for inverse rendering evaluation on real objects. Advances in Neural Information Processing Systems36, 36951–36962 (2023)

work page 2023
[19]

ACM Transactions on Graphics42(4), 114 (2023)

Liu,Y.,Wang,P.,Lin,C.,Long,X.,Wang,J.,Liu,L.,Komura,T.,Wang,W.:NeRO: Neural geometry and BRDF reconstruction of reflective objects from multiview images. ACM Transactions on Graphics42(4), 114 (2023)

work page 2023
[20]

IEEE Transactions on Pattern Analysis and Machine Intelligence38(1), 129–141 (2015)

Lombardi, S., Nishino, K.: Reflectance and illumination recovery in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence38(1), 129–141 (2015)

work page 2015
[21]

In: Proceedings of International Conference on Learning Representations (ICLR) (2019)

Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: Proceedings of International Conference on Learning Representations (ICLR) (2019)

work page 2019
[22]

In: ACM SIGGRAPH 2024 Conference Papers

Luo, J., Ceylan, D., Yoon, J.S., Zhao, N., Philip, J., Frühstück, A., Li, W., Richardt, C., Wang, T.: IntrinsicDiffusion: Joint intrinsic layers from latent diffusion models. In: ACM SIGGRAPH 2024 Conference Papers. pp. 74:1–74:11 (2024)

work page 2024
[23]

Matusik, W.: A data-driven reflectance model. Ph.D. thesis, Massachusetts Institute of Technology (2003)

work page 2003
[24]

Com- munications of the ACM65(1), 99–106 (2021)

Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: Representing scenes as neural radiance fields for view synthesis. Com- munications of the ACM65(1), 99–106 (2021)

work page 2021
[25]

In: Proceedings of International Conference on Computer Vision (ICCV)

Murmann, L., Gharbi, M., Aittala, M., Durand, F.: A dataset of multi-illumination images in the wild. In: Proceedings of International Conference on Computer Vision (ICCV). pp. 4080–4089 (2019)

work page 2019
[26]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Pei, F., Bai, J., Feng, X., Bi, Z., Zhou, K., Wu, H.: Opensubstance: A high-quality measured dataset of multi-view and-lighting images and shapes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 5221–5231 (2025)

work page 2025
[27]

ACM Transactions on Graphics21(3), 267–276 (2002)

Reinhard, E., Stark, M., Shirley, P., Ferwerda, J.: Photographic tone reproduction for digital images. ACM Transactions on Graphics21(3), 267–276 (2002)

work page 2002
[28]

In: European Conference on Computer Vision

Rudnev, V., Elgharib, M., Smith, W., Liu, L., Golyanik, V., Theobalt, C.: Nerf for outdoor scene relighting. In: European Conference on Computer Vision. pp. 615–631. Springer (2022)

work page 2022
[29]

In: 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003

Scharstein, D., Szeliski, R.: High-accuracy stereo depth maps using structured light. In: 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings. vol. 1, pp. I–I. IEEE (2003)

work page 2003
[30]

In: Proceedings of International Conference on Computer Vision (ICCV)

Sengupta, S., Gu, J., Kim, K., Liu, G., Jacobs, D.W., Kautz, J.: Neural inverse rendering of an indoor scene from a single image. In: Proceedings of International Conference on Computer Vision (ICCV). pp. 8598–8607. IEEE (2019)

work page 2019
[31]

In: Proceedings of Computer Vision and Pattern Recognition (CVPR)

Shi, B., Wu, Z., Mo, Z., Duan, D., Yeung, S.K., Tan, P.: A benchmark dataset and evaluation for non-lambertian and uncalibrated photometric stereo. In: Proceedings of Computer Vision and Pattern Recognition (CVPR). pp. 3707–3716 (2016)

work page 2016
[32]

In: Proceedings of International Conference on Computer Vision (ICCV)

Teufel, T., Gera, P., Zhou, X., Iqbal, U., Rao, P., Kautz, J., Golyanik, V., Theobalt, C.: HumanOLAT: A large-scale dataset for full-body human relighting and novel- view synthesis. In: Proceedings of International Conference on Computer Vision (ICCV). pp. 29131–29141 (2025)

work page 2025
[33]

In: Proceedings of Computer Vision and Pattern Recognition (CVPR)

Toschi, M., De Matteo, R., Spezialetti, R., De Gregorio, D., Di Stefano, L., Salti, S.: Relight my NeRF: A dataset for novel view synthesis and relighting of real world objects. In: Proceedings of Computer Vision and Pattern Recognition (CVPR). pp. 20762–20772 (2023)

work page 2023
[34]

Ummenhofer, B., Agrawal, S., Sepúlveda, R., Lao, Y., Zhang, K., Cheng, T., Richter, S.R., Wang, S., Ros, G.: Objects with lighting: A real-world dataset for evaluating reconstruction and rendering for object relighting. In: 3DV. IEEE (2024)

work page 2024
[35]

Rendering techniques2007, 18th (2007)

Walter, B., Marschner, S.R., Li, H., Torrance, K.E.: Microfacet models for refraction through rough surfaces. Rendering techniques2007, 18th (2007)

work page 2007
[36]

International Journal of Computer Vision134(6), 267 (2026).https: //doi.org/10.1007/s11263-026-02833-z

Wang, L., Tran, D.M., Cui, R., TG, T., Dahl, A.B., Bigdeli, S.A., Frisvad, J.R., Chandraker, M.: Materialist: Physically based editing using single-image inverse rendering. International Journal of Computer Vision134(6), 267 (2026).https: //doi.org/10.1007/s11263-026-02833-z

work page doi:10.1007/s11263-026-02833-z 2026
[37]

In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Yi, R., Zhu, C., Xu, K.: Weakly-supervised single-view image relighting. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8402–8411 (2023)

work page 2023
[38]

Rgb↔x: Image decomposition and synthesis using material- and lighting-aware diffusion models

Zeng, Z., Deschaintre, V., Georgiev, I., Hold-Geoffroy, Y., Hu, Y., Luan, F., Yan, L.Q., Hašan, M.: Rgb↔x: Image decomposition and synthesis using material- and lighting-aware diffusion models. In: ACM SIGGRAPH 2024 Conference Papers. pp. 75:1–75:11. ACM (2024).https://doi.org/10.1145/3641519.3657445

work page doi:10.1145/3641519.3657445 2024
[39]

In: Proceedings of Computer Vision and Pattern Recognition (CVPR)

Zhang, K., Luan, F., Wang, Q., Bala, K., Snavely, N.: PhySG: Inverse rendering with spherical Gaussians for physics-based material editing and relighting. In: Proceedings of Computer Vision and Pattern Recognition (CVPR). pp. 5453–5462 (2021)

work page 2021
[40]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Zhang, X., Tseng, N., Syed, A., Bhasin, R., Jaipuria, N.: Simbar: Single image-based scene relighting for effective data augmentation for automated driving vision tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3718–3728 (2022)

work page 2022
[41]

ACM Transactions on Graphics40(6), 237 (2021)

Zhang, X., Srinivasan, P.P., Deng, B., Debevec, P., Freeman, W.T., Barron, J.T.: NeRFactor: Neural factorization of shape and reflectance under an unknown illumi- nation. ACM Transactions on Graphics40(6), 237 (2021)

work page 2021
[42]

In: Proceedings of Computer Vision and Pattern Recognition (CVPR)

Zhang, Y., Sun, J., He, X., Fu, H., Jia, R., Zhou, X.: Modeling indirect illumination for inverse rendering. In: Proceedings of Computer Vision and Pattern Recognition (CVPR). pp. 18643–18652 (2022)

work page 2022
[43]

Advances in Neural Information Processing Systems37, 42593–42617 (2024)

Zhao, X., Srinivasan, P., Verbin, D., Park, K., Martin Brualla, R., Henzler, P.: Illuminerf: 3d relighting without inverse rendering. Advances in Neural Information Processing Systems37, 42593–42617 (2024)

work page 2024
[44]

arXiv preprint arXiv:2511.02483 (2025)

Zhou, X., Chen, J., Rao, P., Teufel, T., Lyu, L., Minasian, T., Sotnychenko, O., Long, X., Habermann, M., Theobalt, C.: Olatverse: A large-scale real-world object dataset with precise lighting control. arXiv preprint arXiv:2511.02483 (2025)

work page arXiv 2025
[45]

In: Proceedings of Computer Vision and Pattern Recognition (CVPR)

Zhu, R., Li, Z., Matai, J., Porikli, F., Chandraker, M.: IRISformer: Dense vision transformers for single-image inverse rendering in indoor scenes. In: Proceedings of Computer Vision and Pattern Recognition (CVPR). pp. 2822–2831. IEEE (2022)

work page 2022
[46]

This proves that the tempo- ral gap in our acquisition pipeline results in physically negligible illumination misalignment for the task of relighting

QUANTITATIVE VALIDATION OF ILLUMINATION ALIGNMENT 1 Supplementary Materials 1 Quantitative Validation of Illumination Alignment We provide a rigorous quantitative validation based on metadata timestamp statistics and solar angular displacement analysis. This proves that the tempo- ral gap in our acquisition pipeline results in physically negligible illumi...

work page
[47]

DETAILS SETTING OF BASELINE BENCHMARK 3 5-scene hold-out test set. We evaluate performance using three standard image quality metrics: Peak Signal to Noise Ratio (PSNR), the Structural Similarity Index (SSIM), and the Learned Perceptual Image Patch Similarity (LPIPS). Baseline ModelsWe selected three representative methods that support single- image relig...

work page
[48]

For every predicted imageIpred and its corresponding ground truthIgt, we solve for an optimal scalarα: α∗ = argmin α ∥Ipred ·α−I gt∥2

ADVANTAGES OF RAW-BASED HDR IMAGE 5 reported experiments (including baselines, finetuning, and our method), we adopt a global least-squares alignment strategy. For every predicted imageIpred and its corresponding ground truthIgt, we solve for an optimal scalarα: α∗ = argmin α ∥Ipred ·α−I gt∥2 . (7) Metrics are computed on the aligned predictionIpred ·α ∗....

work page
[49]

Wang et al

Shadow Detail and Color Fidelity:In low-light environments, the RAW format, with its high bit depth (typically 12 or 14 bits), captures extensive 6 LZ. Wang et al. detail in the dark regions. By increasing the exposure in post-processing, the original information can be recovered with minimal loss. Conversely, since this information is already discarded d...

work page
[50]

The RAW data fully retains the color and tonal information within these bright areas

Highlight Information Retention:In highlight regions, while both a RAW- based HDR image and a JPG image may appear as pure white on a Standard Dynamic Range (SDR) display due to exceeding the display’s maximum brightness, the amount of information they contain is fundamentally different. The RAW data fully retains the color and tonal information within th...

work page
[51]

This loss of high-frequency detail would severely compromise the evaluation of texture preservation and generation

Effective Resolution:Even with an 8K 360◦ capture, projecting the image to a standard 40mm field-of-view (FOV) yields an effective resolution signif- icantly lower than that of the 24MP+ full-frame Sony A7 used in our rig. This loss of high-frequency detail would severely compromise the evaluation of texture preservation and generation

work page
[52]

More critically, they lack the dynamic range of the Sony A7’s 14-bit RAW optical path

Image Quality & Dynamic Range:Panoramic cameras typically utilize smaller sensors that introduce noise and chromatic aberration. More critically, they lack the dynamic range of the Sony A7’s 14-bit RAW optical path. High dynamic range is essential for outdoor relighting tasks to accurately recover information in deep shadows and bright highlights

work page
[53]

METHODOLOGYFORDETERMININGTHENODALPOINT(NO-PARALLAXPOINT) 7 Fig.10:Side-by-side camera setup, a non-aligned envmap camera will record direct sun light, but scene camera records a shadow

work page
[54]

nodal point,

Optical Artifacts:360 ◦ cameras rely on heavy distortion correction and stitching algorithms, which introduce resampling artifacts. Using a dedicated rectilinear lens ensures the benchmark data is free from such algorithmic interference. 5.2 Necessity of Strict Spatial Alignment Precise co-location of the environment map camera (Insta360) and the scene ca...

work page
[55]

in-the-wild

METHODOLOGYFORDETERMININGTHENODALPOINT(NO-PARALLAXPOINT) 9 the entrance pupil. Adjustments must then be made to the fore-aft position of the camera on the panoramic head, and the rotational test is repeated. The objective is to achieve a state where, upon panning the camera to the left and right, the two reference objects remain in perfect alignment, with...

work page
[56]

Pairwise Comparison:For each scene, annotators performed a sequential, pairwise comparison of adjacent time steps (e.g.,t0 vs.t 1,t 1 vs.t 2, etc.)

work page
[57]

Difference Visualization:To aid the human annotators, we generated absolute pixel-difference images for each pair. This visualization technique ef- fectively accentuates the contours of misaligned objects, where pixel gradients are highest, making the boundaries of dynamic elements more conspicuous

work page
[58]

The primary targets for masking were clouds and moving vegetation (leaves, branches, and grass)

Manual Annotation:Annotators manually painted masks over all identified dynamic regions for each image pair. The primary targets for masking were clouds and moving vegetation (leaves, branches, and grass)

work page
[59]

This ensures that any element that moved at any point during the capture sequence is included in the aggregate mask

Mask Aggregation:The final mask for the entire scene is generated by computing the union of all pairwise masks. This ensures that any element that moved at any point during the capture sequence is included in the aggregate mask. We explicitly excluded two categories of dynamic effects from masking. First, watersurfaces(e.g.,lakesandseas)werenotannotateddu...

work page
[60]

Let the per-pixel surface properties be defined by basecolorcb, normaln, rough- ness α, and metallicitym, and let the environment illumination be given as a HDR mapLenv(ωi)

DIFFERENTIABLE COOK–TORRANCE RENDERER 11 7 Differentiable Cook–Torrance Renderer To evaluate the physics consistency of predicted G-buffers, we employ a fully differentiable Cook–Torrance microfacet model with split-sum approximation [5]. Let the per-pixel surface properties be defined by basecolorcb, normaln, rough- ness α, and metallicitym, and let the ...

work page