pith. machine review for the scientific record. sign in

arxiv: 2605.11696 · v1 · submitted 2026-05-12 · 💻 cs.CV · cs.AI· cs.GR

Recognition: 2 theorem links

· Lean Theorem

WildRelight: A Real-World Benchmark and Physics-Guided Adaptation for Single-Image Relighting

Authors on Pith no claims yet

Pith reviewed 2026-05-13 06:25 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.GR
keywords relightingmodelssingle-imagewildrelightadaptationdatasetsyntheticaligned
0
0 comments X

The pith

WildRelight dataset lets synthetic relighting models adapt to real outdoor scenes using only temporal lighting changes as supervision.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents WildRelight as the first in-the-wild dataset of high-resolution outdoor scenes captured under strictly aligned sequences of natural illumination, each paired with an HDR environment map. It demonstrates that current single-image relighting models trained on synthetic data suffer large domain shifts when applied to these real scenes. The authors then show that the dataset's temporal alignment supplies a self-supervised signal that can be used at inference time to adapt models to real statistics. By combining diffusion posterior sampling with sampling-aware test-time adaptation, the framework updates the model on the fly without any ground-truth relit images. Readers should care because the work both quantifies the sim-to-real gap and supplies a concrete, physics-guided way to close it for practical relighting.

Core claim

WildRelight supplies strictly aligned temporal sequences of real outdoor scenes under varying natural illuminations together with HDR environment maps; this structure enables a physics-guided inference method that integrates Diffusion Posterior Sampling with temporal Sampling-Aware Test-Time Adaptation, allowing synthetic models to align with real-world lighting statistics on the fly and turning the sim-to-real problem into a tractable self-supervised task.

What carries the argument

Strictly aligned temporal structure of the WildRelight scenes, serving as a self-supervised constraint inside a DPS-plus-temporal-TTA inference framework.

If this is right

  • State-of-the-art synthetic relighting models exhibit severe domain shifts on real-world data.
  • The temporal structure enables effective self-supervised adaptation without ground-truth relit images.
  • The same dataset functions as a rigorous benchmark for measuring progress in single-image relighting.
  • The DPS-TTA combination converts an intractable domain-shift problem into an on-the-fly self-supervised task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar temporal or multi-frame constraints could be exploited for domain adaptation in other single-image translation tasks such as dehazing or colorization.
  • Training future relighting networks with built-in test-time adaptation modules might reduce the need for separate inference-time tuning.
  • Releasing the dataset publicly will let researchers test whether the adaptation generalizes beyond the outdoor scenes used here.
  • Applying the same framework to indoor or synthetic-to-real indoor data would test whether the temporal self-supervision depends on natural outdoor light variation.
  • keywords:[
  • single-image relighting
  • real-world dataset
  • domain adaptation

Load-bearing premise

The captured scenes maintain perfect temporal alignment that supplies a reliable self-supervised signal for domain adaptation.

What would settle it

A direct comparison showing that models adapted with the temporal TTA produce lighting that is inconsistent across the sequence or mismatches the supplied HDR environment maps would falsify the adaptation claim.

Figures

Figures reproduced from arXiv: 2605.11696 by Jeppe Revall Frisvad, Lezhong Wang, Mehmet Onurcan Kaya, Siavash Bigdeli.

Figure 1
Figure 1. Figure 1: Example image and illumination pairs from our WildRelight dataset Despite this remarkable progress, a critical question remains unanswered: how well do these models perform outside the sanitized confines of synthetic data? The training and, more importantly, the quantitative evaluation of most inverse rendering models are predominantly conducted on synthetic datasets [14,17,38,41]. While invaluable for dev… view at source ↗
Figure 2
Figure 2. Figure 2: We selected some example scenes from the dataset. We meticulously curated the dataset to represent a diverse range of challenging scenarios. The collected scenes encompass complex environmental conditions including tree structures, transparent glass surfaces, and reflective glass materials. For each scene, temporal variations are captured through photographs taken at different time intervals, accompanied b… view at source ↗
Figure 3
Figure 3. Figure 3: (a) Capture settings. Both the Insta360 Pro2 and Sony A7M2 are mounted on a rail system to enable front-to-back adjustments for precise alignment with the nodal point. (b) Photos of Xrite ColorChecker are used to calibrate color between main camera Sony A7 and envmap camera Insta360 Pro2 Before applying color calibration After applying color calibration [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison examples of color calibration effects. Due to the inherently sophisticated color science in profes￾sional grade products such as the Sony A7 and Insta360 Pro2, visual differences between pre- and post-color calibration images are negligible. To quantitatively analyze the calibration effects, a his￾togram was incorporated into the lower left corner of the image. While visual in￾spection reveals m… view at source ↗
Figure 5
Figure 5. Figure 5: Example of Dynamic Scene El￾ements. As showed in the figure, the left and right images were captured at different times with a fixed camera po￾sition. Despite the static camera setup, subtle movements of the leaves occur due to external factors such as wind. To address this, we manually created masks for these dynamic regions, allow￾ing researchers to determine whether to include them when computing metric… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative Comparison of Different Methods on the Test Set Relighting. The numerical values displayed beneath the image correspond to the following metrics: SSIM, PSNR, and LPIPS. In zero shot scenarios, both DiffusionRenderer [17] and RGB↔X [38] struggle to accurately render image brightness. After finetuning, DiffusionRenderer demonstrates improved alignment with GT in the test set, achieving more accur… view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative Ablation Study Results. We visualize the results of our proposed framework on the WildRelight dataset. The envmap on bottom left of GT image correspond to target illuminations used for relighting. protocol: for N lightings, one is the test target while the other N − 1 serve as self-supervised signals. This rigorously simulates real-world deployment without ground-truth supervision. Ablation Stu… view at source ↗
Figure 8
Figure 8. Figure 8: Distribution of capture time differences (∆t). The histogram shows the frequency of absolute time delays between the scene image and the environment map capture across the dataset. The distribution is heavily right-skewed, with the vast majority of samples having a delay of less than 20 seconds, confirming that large delays are rare. 1.3 Impact on Relighting Tasks Modern single-image relighting algorithms … view at source ↗
Figure 9
Figure 9. Figure 9: Advantage of RAW format with HDR. When using HDR photo, the details of the photo are preserved. Therefore, by adjusting the exposure settings, the original colors of the photo can be accurately restored. The data recorded by a camera sensor in RAW format exhibits a fundamentally linear relationship with the light intensity of the actual scene. This linearity is a core advantage in computational photography… view at source ↗
Figure 10
Figure 10. Figure 10: Side-by-side camera setup, a non-aligned envmap camera will record direct sun light, but scene camera records a shadow. 3. Optical Artifacts: 360◦ cameras rely on heavy distortion correction and stitching algorithms, which introduce resampling artifacts. Using a dedicated rectilinear lens ensures the benchmark data is free from such algorithmic interference. 5.2 Necessity of Strict Spatial Alignment Preci… view at source ↗
Figure 11
Figure 11. Figure 11: Nodal point alignment. When the camera is not positioned at the nodal point, rotating the camera causes the nearby utility pole to fail to occlude the poles behind it. In contrast, when the camera is at the nodal point, the nearby utility pole can occlude the poles behind it. The empirical determination of the entrance pupil’s location, or the no￾parallax point, is a foundational procedure in panoramic ph… view at source ↗
Figure 12
Figure 12. Figure 12: More examples from our dataset 6.3 Details of Dynamic Scene Elements Annotation A significant challenge in capturing longitudinal, “in-the-wild" datasets is the presence of dynamic scene elements. While our capture rig ensures a static viewpoint, the long temporal intervals between acquisitions mean that elements such as wind blown foliage, grass, and cloud formations inevitably move. Although computation… view at source ↗
Figure 13
Figure 13. Figure 13: UI Interface for Marking Dynamic Scene Elements. Volunteers utilize a brush tool to create masks or an eraser tool to remove incorrectly marked regions. In the illustrated example, the areas indicated by red arrows correspond to discrepancies between the two photographs, highlighting regions where inconsistencies exist. Our annotation pipeline is as follows: 1. Pairwise Comparison: For each scene, annotat… view at source ↗
read the original abstract

Recent single-image relighting methods, powered by advanced generative models, have achieved impressive photorealism on synthetic benchmarks. However, their effectiveness in the complex visual landscape of the real world remains largely unverified. A critical gap exists, as current datasets are typically designed for multi-view reconstruction and fail to address the unique challenges of single-image relighting. To bridge this synthetic-to-real gap, we introduce WildRelight, the first in-the-wild dataset specifically created for evaluating single-image relighting models. WildRelight features a diverse collection of high-resolution outdoor scenes, captured under strictly aligned, temporally varying natural illuminations, each paired with a high-dynamic-range environment map. Using this data, we establish a rigorous benchmark revealing that state-of-the-art models trained on synthetic data suffer from severe domain shifts. The strictly aligned temporal structure of WildRelight enables a new paradigm for domain adaptation. We demonstrate this by introducing a physics-guided inference framework that leverages the captured natural light evolution as a self-supervised constraint. By integrating Diffusion Posterior Sampling (DPS) with temporal Sampling-Aware Test-Time Adaptation (TTA), we show that the dataset allows synthetic models to align with real-world statistics on-the-fly, transforming the intractable sim-to-real challenge into a tractable self-supervised task. The dataset and code will be made publicly available to foster robust, physically-grounded relighting research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces WildRelight, the first in-the-wild dataset for single-image relighting consisting of high-resolution outdoor scenes captured under strictly aligned, temporally varying natural illuminations, each paired with an HDR environment map. It establishes a benchmark showing severe domain shifts in synthetic-trained SOTA models and proposes a physics-guided inference framework that integrates Diffusion Posterior Sampling (DPS) with temporal Sampling-Aware Test-Time Adaptation (TTA) to enable on-the-fly self-supervised adaptation of synthetic models to real-world statistics using the temporal structure as constraint.

Significance. If the adaptation framework holds, the work supplies both a needed real-world benchmark for single-image relighting and a self-supervised adaptation paradigm that could convert the sim-to-real gap into a tractable task without requiring ground-truth relit images, potentially improving generalization of generative relighting methods to outdoor scenes.

major comments (3)
  1. Abstract and §4 (method description): the central claim that DPS+TTA enables synthetic models to 'align with real-world statistics on-the-fly' and transforms the sim-to-real challenge into a 'tractable self-supervised task' is unsupported by any quantitative metrics, baseline comparisons, ablation studies, or implementation details in the provided text, rendering the claim unverifiable.
  2. Abstract and §3 (dataset): the adaptation method rests on the assumption that 'strictly aligned temporal structure' supplies a reliable self-supervised constraint with differences due solely to illumination; no analysis, statistics, or validation is supplied to confirm absence of non-illumination dynamics (foliage motion, cloud shadows, specular changes, or sub-pixel drift) that would corrupt the DPS posterior sampling and TTA objective.
  3. §5 (experiments): without reported numbers on adaptation performance (e.g., PSNR, LPIPS, or perceptual metrics before/after TTA on held-out real frames), it is impossible to assess whether the physics-guided loss actually drives alignment or merely fits to capture artifacts.
minor comments (2)
  1. Abstract: the phrase 'the dataset allows synthetic models to align...' should be rephrased to clarify that the alignment is demonstrated via the proposed method rather than being an intrinsic property of the data alone.
  2. Notation: the distinction between 'Sampling-Aware Test-Time Adaptation (TTA)' and standard TTA is introduced without an equation or pseudocode block; a short algorithmic outline would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thorough and constructive review. We appreciate the acknowledgment of WildRelight's potential as the first in-the-wild benchmark for single-image relighting and the value of the proposed adaptation paradigm. We address each major comment point by point below and will revise the manuscript accordingly to strengthen the presentation and verifiability of our claims.

read point-by-point responses
  1. Referee: Abstract and §4 (method description): the central claim that DPS+TTA enables synthetic models to 'align with real-world statistics on-the-fly' and transforms the sim-to-real challenge into a 'tractable self-supervised task' is unsupported by any quantitative metrics, baseline comparisons, ablation studies, or implementation details in the provided text, rendering the claim unverifiable.

    Authors: We agree that the current version relies on qualitative demonstrations to support the adaptation claims. To address this, the revised manuscript will expand §4 with full implementation details of the DPS+TTA integration, including hyperparameters and sampling procedures. We will also add quantitative results, baseline comparisons, and ablation studies to §5, providing metrics that directly verify the on-the-fly alignment with real-world statistics. revision: yes

  2. Referee: Abstract and §3 (dataset): the adaptation method rests on the assumption that 'strictly aligned temporal structure' supplies a reliable self-supervised constraint with differences due solely to illumination; no analysis, statistics, or validation is supplied to confirm absence of non-illumination dynamics (foliage motion, cloud shadows, specular changes, or sub-pixel drift) that would corrupt the DPS posterior sampling and TTA objective.

    Authors: The capture protocol was designed with fixed-camera, short-interval sequences to isolate illumination as the dominant variable. We acknowledge the need for explicit validation. In the revision, §3 will include a new analysis subsection with alignment statistics (e.g., sub-pixel registration error), optical-flow-based quantification of non-illumination motion, and examples demonstrating that such factors remain negligible relative to lighting changes, thereby supporting the self-supervised constraint. revision: yes

  3. Referee: §5 (experiments): without reported numbers on adaptation performance (e.g., PSNR, LPIPS, or perceptual metrics before/after TTA on held-out real frames), it is impossible to assess whether the physics-guided loss actually drives alignment or merely fits to capture artifacts.

    Authors: We concur that numerical before/after evaluation is required to substantiate the adaptation efficacy. The current experiments emphasize the benchmark and qualitative results. The revised §5 will report PSNR, LPIPS, and additional perceptual metrics on held-out real frames, comparing performance prior to and after TTA, to demonstrate that the physics-guided objective improves alignment beyond artifact fitting. revision: yes

Circularity Check

0 steps flagged

No significant circularity; new dataset with established DPS+TTA adaptation

full rationale

The paper introduces WildRelight as an independent data collection with strictly aligned temporal frames and HDR maps. It then applies the pre-existing Diffusion Posterior Sampling (DPS) and Sampling-Aware Test-Time Adaptation (TTA) frameworks to leverage that temporal structure as a self-supervised signal. No equations are shown that define a quantity in terms of itself, rename a fitted parameter as a prediction, or reduce the alignment result to a self-citation chain. The temporal stationarity assumption is an external modeling choice about the data, not a definitional loop. The derivation therefore remains self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the dataset's temporal alignment property and the assumption that natural light evolution supplies a usable self-supervised signal for adaptation.

axioms (1)
  • domain assumption Strictly aligned temporal captures under varying natural illuminations provide a valid self-supervised constraint for domain adaptation.
    Invoked to justify the physics-guided inference framework.

pith-pipeline@v0.9.0 · 5571 in / 1156 out tokens · 97547 ms · 2026-05-13T06:25:56.255915+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages

  1. [1]

    Aksoy, Y., Kim, C., Kellnhofer, P., Paris, S., Elgharib, M., Pollefeys, M., Matusik, W.:Adatasetofflashandambientilluminationpairsfromthecrowd.In:Proceedings of the European Conference on Computer Vision (ECCV). pp. 634–649 (2018)

  2. [2]

    IEEE Transactions on Pattern Analysis and Machine Intelligence37(8), 1670–1687 (2014)

    Barron, J.T., Malik, J.: Shape, illumination, and reflectance from shading. IEEE Transactions on Pattern Analysis and Machine Intelligence37(8), 1670–1687 (2014)

  3. [3]

    In: Proceedings of International Conference on Computer Vision (ICCV)

    Boss, M., Braun, R., Jampani, V., Barron, J.T., Liu, C., Lensch, H.: NeRD: Neural reflectance decomposition from image collections. In: Proceedings of International Conference on Computer Vision (ICCV). pp. 12684–12694 (2021)

  4. [4]

    In: The Eleventh International Con- ference on Learning Representations (2023),https://openreview.net/forum?id= OnD9zGAGT0k

    Chung, H., Kim, J., Mccann, M.T., Klasky, M.L., Ye, J.C.: Diffusion posterior sampling for general noisy inverse problems. In: The Eleventh International Con- ference on Learning Representations (2023),https://openreview.net/forum?id= OnD9zGAGT0k

  5. [5]

    ACM Transactions on Graphics (ToG)1(1), 7–24 (1982)

    Cook, R.L., Torrance, K.E.: A reflectance model for computer graphics. ACM Transactions on Graphics (ToG)1(1), 7–24 (1982)

  6. [6]

    In: Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques

    Debevec, P.E., Malik, J.: Recovering high dynamic range radiance maps from photographs. In: Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques. p. 369–378. SIGGRAPH ’97, ACM Press/Addison- Wesley Publishing Co., USA (1997).https://doi.org/10.1145/258734.258884, https://doi.org/10.1145/258734.258884

  7. [7]

    In: Proceedings of SIGGRAPH ’96

    Debevec, P.E., Taylor, C.J., Malik, J.: Modeling and rendering architecture from photographs: a hybrid geometry-and image-based approach. In: Proceedings of SIGGRAPH ’96. pp. 11–20 (1996)

  8. [8]

    In: Proceedings of International Conference on Computer Vision (ICCV)

    Haque, A., Tancik, M., Efros, A.A., Holynski, A., Kanazawa, A.: Instruct- NeRF2NeRF: Editing 3D scenes with instructions. In: Proceedings of International Conference on Computer Vision (ICCV). pp. 19740–19750 (2023)

  9. [9]

    ICLR1(2), 3 (2022)

    Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: LoRA: Low-rank adaptation of large language models. ICLR1(2), 3 (2022)

  10. [10]

    Jakob, W., Speierer, S., Roussel, N., Nimier-David, M., Vicini, D., Zeltner, T., Nico- let, B., Crespo, M., Leroy, V., Zhang, Z.: Mitsuba 3 renderer (2022), https://mitsuba- renderer.org

  11. [11]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Jensen, R., Dahl, A., Vogiatzis, G., Tola, E., Aanæs, H.: Large scale multi-view stereopsis evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 406–413 (2014)

  12. [12]

    ACM Transactions on Graphics42(4), 139 (July 2023),https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/

    Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3D Gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics42(4), 139 (July 2023),https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/

  13. [13]

    Advances in Neural Information Processing Systems36, 46938–46957 (2023)

    Kuang, Z., Zhang, Y., Yu, H.X., Agarwala, S., Wu, E., Wu, J., et al.: Stanford- ORB: a real-world 3D object inverse rendering benchmark. Advances in Neural Information Processing Systems36, 46938–46957 (2023)

  14. [14]

    In: Proceedings of Computer Vision and Pattern Recognition (CVPR)

    Li, Z., Wang, L., Huang, X., Pan, C., Yang, J.: PhyIR: Physics-based inverse rendering for panoramic indoor images. In: Proceedings of Computer Vision and Pattern Recognition (CVPR). pp. 12713–12723 (2022)

  15. [15]

    In: European Conference on Computer Vision (ECCV)

    Li, Z., Shi, J., Bi, S., Zhu, R., Sunkavalli, K., Hašan, M., Xu, Z., Ramamoorthi, R., Chandraker, M.: Physically-based editing of indoor scene lighting from a single image. In: European Conference on Computer Vision (ECCV). pp. 555–572. Springer (2022)

  16. [16]

    Wang et al

    Li, Z., Yu, T.W., Sang, S., Wang, S., Song, M., Liu, Y., Yeh, Y.Y., Zhu, R., Gundavarapu, N., Shi, J., Bi, S., Yu, H.X., Xu, Z., Sunkavalli, K., Hasan, M., Ra- mamoorthi, R., Chandraker, M.: OpenRooms: An open framework for photorealistic 16 LZ. Wang et al. indoor scene datasets. In: Proceedings of Computer Vision and Pattern Recognition (CVPR). pp. 7190–...

  17. [17]

    In: Proceedings of Computer Vision and Pattern Recognition (CVPR)

    Liang, R., Gojcic, Z., Ling, H., Munkberg, J., Hasselgren, J., Lin, C.H., Gao, J., Keller, A., Vijaykumar, N., Fidler, S., et al.: Diffusion renderer: Neural inverse and forward rendering with video diffusion models. In: Proceedings of Computer Vision and Pattern Recognition (CVPR). pp. 26069–26080 (2025)

  18. [18]

    Advances in Neural Information Processing Systems36, 36951–36962 (2023)

    Liu, I., Chen, L., Fu, Z., Wu, L., Jin, H., Li, Z., Wong, C.M.R., Xu, Y., Ramamoorthi, R., Xu, Z., et al.: OpenIllumination: A multi-illumination dataset for inverse rendering evaluation on real objects. Advances in Neural Information Processing Systems36, 36951–36962 (2023)

  19. [19]

    ACM Transactions on Graphics42(4), 114 (2023)

    Liu,Y.,Wang,P.,Lin,C.,Long,X.,Wang,J.,Liu,L.,Komura,T.,Wang,W.:NeRO: Neural geometry and BRDF reconstruction of reflective objects from multiview images. ACM Transactions on Graphics42(4), 114 (2023)

  20. [20]

    IEEE Transactions on Pattern Analysis and Machine Intelligence38(1), 129–141 (2015)

    Lombardi, S., Nishino, K.: Reflectance and illumination recovery in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence38(1), 129–141 (2015)

  21. [21]

    In: Proceedings of International Conference on Learning Representations (ICLR) (2019)

    Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: Proceedings of International Conference on Learning Representations (ICLR) (2019)

  22. [22]

    In: ACM SIGGRAPH 2024 Conference Papers

    Luo, J., Ceylan, D., Yoon, J.S., Zhao, N., Philip, J., Frühstück, A., Li, W., Richardt, C., Wang, T.: IntrinsicDiffusion: Joint intrinsic layers from latent diffusion models. In: ACM SIGGRAPH 2024 Conference Papers. pp. 74:1–74:11 (2024)

  23. [23]

    Matusik, W.: A data-driven reflectance model. Ph.D. thesis, Massachusetts Institute of Technology (2003)

  24. [24]

    Com- munications of the ACM65(1), 99–106 (2021)

    Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: Representing scenes as neural radiance fields for view synthesis. Com- munications of the ACM65(1), 99–106 (2021)

  25. [25]

    In: Proceedings of International Conference on Computer Vision (ICCV)

    Murmann, L., Gharbi, M., Aittala, M., Durand, F.: A dataset of multi-illumination images in the wild. In: Proceedings of International Conference on Computer Vision (ICCV). pp. 4080–4089 (2019)

  26. [26]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Pei, F., Bai, J., Feng, X., Bi, Z., Zhou, K., Wu, H.: Opensubstance: A high-quality measured dataset of multi-view and-lighting images and shapes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 5221–5231 (2025)

  27. [27]

    ACM Transactions on Graphics21(3), 267–276 (2002)

    Reinhard, E., Stark, M., Shirley, P., Ferwerda, J.: Photographic tone reproduction for digital images. ACM Transactions on Graphics21(3), 267–276 (2002)

  28. [28]

    In: European Conference on Computer Vision

    Rudnev, V., Elgharib, M., Smith, W., Liu, L., Golyanik, V., Theobalt, C.: Nerf for outdoor scene relighting. In: European Conference on Computer Vision. pp. 615–631. Springer (2022)

  29. [29]

    In: 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003

    Scharstein, D., Szeliski, R.: High-accuracy stereo depth maps using structured light. In: 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings. vol. 1, pp. I–I. IEEE (2003)

  30. [30]

    In: Proceedings of International Conference on Computer Vision (ICCV)

    Sengupta, S., Gu, J., Kim, K., Liu, G., Jacobs, D.W., Kautz, J.: Neural inverse rendering of an indoor scene from a single image. In: Proceedings of International Conference on Computer Vision (ICCV). pp. 8598–8607. IEEE (2019)

  31. [31]

    In: Proceedings of Computer Vision and Pattern Recognition (CVPR)

    Shi, B., Wu, Z., Mo, Z., Duan, D., Yeung, S.K., Tan, P.: A benchmark dataset and evaluation for non-lambertian and uncalibrated photometric stereo. In: Proceedings of Computer Vision and Pattern Recognition (CVPR). pp. 3707–3716 (2016)

  32. [32]

    In: Proceedings of International Conference on Computer Vision (ICCV)

    Teufel, T., Gera, P., Zhou, X., Iqbal, U., Rao, P., Kautz, J., Golyanik, V., Theobalt, C.: HumanOLAT: A large-scale dataset for full-body human relighting and novel- view synthesis. In: Proceedings of International Conference on Computer Vision (ICCV). pp. 29131–29141 (2025)

  33. [33]

    In: Proceedings of Computer Vision and Pattern Recognition (CVPR)

    Toschi, M., De Matteo, R., Spezialetti, R., De Gregorio, D., Di Stefano, L., Salti, S.: Relight my NeRF: A dataset for novel view synthesis and relighting of real world objects. In: Proceedings of Computer Vision and Pattern Recognition (CVPR). pp. 20762–20772 (2023)

  34. [34]

    Ummenhofer, B., Agrawal, S., Sepúlveda, R., Lao, Y., Zhang, K., Cheng, T., Richter, S.R., Wang, S., Ros, G.: Objects with lighting: A real-world dataset for evaluating reconstruction and rendering for object relighting. In: 3DV. IEEE (2024)

  35. [35]

    Rendering techniques2007, 18th (2007)

    Walter, B., Marschner, S.R., Li, H., Torrance, K.E.: Microfacet models for refraction through rough surfaces. Rendering techniques2007, 18th (2007)

  36. [36]

    International Journal of Computer Vision134(6), 267 (2026).https: //doi.org/10.1007/s11263-026-02833-z

    Wang, L., Tran, D.M., Cui, R., TG, T., Dahl, A.B., Bigdeli, S.A., Frisvad, J.R., Chandraker, M.: Materialist: Physically based editing using single-image inverse rendering. International Journal of Computer Vision134(6), 267 (2026).https: //doi.org/10.1007/s11263-026-02833-z

  37. [37]

    In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Yi, R., Zhu, C., Xu, K.: Weakly-supervised single-view image relighting. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8402–8411 (2023)

  38. [38]

    Rgb↔x: Image decomposition and synthesis using material- and lighting-aware diffusion models

    Zeng, Z., Deschaintre, V., Georgiev, I., Hold-Geoffroy, Y., Hu, Y., Luan, F., Yan, L.Q., Hašan, M.: Rgb↔x: Image decomposition and synthesis using material- and lighting-aware diffusion models. In: ACM SIGGRAPH 2024 Conference Papers. pp. 75:1–75:11. ACM (2024).https://doi.org/10.1145/3641519.3657445

  39. [39]

    In: Proceedings of Computer Vision and Pattern Recognition (CVPR)

    Zhang, K., Luan, F., Wang, Q., Bala, K., Snavely, N.: PhySG: Inverse rendering with spherical Gaussians for physics-based material editing and relighting. In: Proceedings of Computer Vision and Pattern Recognition (CVPR). pp. 5453–5462 (2021)

  40. [40]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Zhang, X., Tseng, N., Syed, A., Bhasin, R., Jaipuria, N.: Simbar: Single image-based scene relighting for effective data augmentation for automated driving vision tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3718–3728 (2022)

  41. [41]

    ACM Transactions on Graphics40(6), 237 (2021)

    Zhang, X., Srinivasan, P.P., Deng, B., Debevec, P., Freeman, W.T., Barron, J.T.: NeRFactor: Neural factorization of shape and reflectance under an unknown illumi- nation. ACM Transactions on Graphics40(6), 237 (2021)

  42. [42]

    In: Proceedings of Computer Vision and Pattern Recognition (CVPR)

    Zhang, Y., Sun, J., He, X., Fu, H., Jia, R., Zhou, X.: Modeling indirect illumination for inverse rendering. In: Proceedings of Computer Vision and Pattern Recognition (CVPR). pp. 18643–18652 (2022)

  43. [43]

    Advances in Neural Information Processing Systems37, 42593–42617 (2024)

    Zhao, X., Srinivasan, P., Verbin, D., Park, K., Martin Brualla, R., Henzler, P.: Illuminerf: 3d relighting without inverse rendering. Advances in Neural Information Processing Systems37, 42593–42617 (2024)

  44. [44]

    arXiv preprint arXiv:2511.02483 (2025)

    Zhou, X., Chen, J., Rao, P., Teufel, T., Lyu, L., Minasian, T., Sotnychenko, O., Long, X., Habermann, M., Theobalt, C.: Olatverse: A large-scale real-world object dataset with precise lighting control. arXiv preprint arXiv:2511.02483 (2025)

  45. [45]

    In: Proceedings of Computer Vision and Pattern Recognition (CVPR)

    Zhu, R., Li, Z., Matai, J., Porikli, F., Chandraker, M.: IRISformer: Dense vision transformers for single-image inverse rendering in indoor scenes. In: Proceedings of Computer Vision and Pattern Recognition (CVPR). pp. 2822–2831. IEEE (2022)

  46. [46]

    This proves that the tempo- ral gap in our acquisition pipeline results in physically negligible illumination misalignment for the task of relighting

    QUANTITATIVE VALIDATION OF ILLUMINATION ALIGNMENT 1 Supplementary Materials 1 Quantitative Validation of Illumination Alignment We provide a rigorous quantitative validation based on metadata timestamp statistics and solar angular displacement analysis. This proves that the tempo- ral gap in our acquisition pipeline results in physically negligible illumi...

  47. [47]

    DETAILS SETTING OF BASELINE BENCHMARK 3 5-scene hold-out test set. We evaluate performance using three standard image quality metrics: Peak Signal to Noise Ratio (PSNR), the Structural Similarity Index (SSIM), and the Learned Perceptual Image Patch Similarity (LPIPS). Baseline ModelsWe selected three representative methods that support single- image relig...

  48. [48]

    For every predicted imageIpred and its corresponding ground truthIgt, we solve for an optimal scalarα: α∗ = argmin α ∥Ipred ·α−I gt∥2

    ADVANTAGES OF RAW-BASED HDR IMAGE 5 reported experiments (including baselines, finetuning, and our method), we adopt a global least-squares alignment strategy. For every predicted imageIpred and its corresponding ground truthIgt, we solve for an optimal scalarα: α∗ = argmin α ∥Ipred ·α−I gt∥2 . (7) Metrics are computed on the aligned predictionIpred ·α ∗....

  49. [49]

    Wang et al

    Shadow Detail and Color Fidelity:In low-light environments, the RAW format, with its high bit depth (typically 12 or 14 bits), captures extensive 6 LZ. Wang et al. detail in the dark regions. By increasing the exposure in post-processing, the original information can be recovered with minimal loss. Conversely, since this information is already discarded d...

  50. [50]

    The RAW data fully retains the color and tonal information within these bright areas

    Highlight Information Retention:In highlight regions, while both a RAW- based HDR image and a JPG image may appear as pure white on a Standard Dynamic Range (SDR) display due to exceeding the display’s maximum brightness, the amount of information they contain is fundamentally different. The RAW data fully retains the color and tonal information within th...

  51. [51]

    This loss of high-frequency detail would severely compromise the evaluation of texture preservation and generation

    Effective Resolution:Even with an 8K 360◦ capture, projecting the image to a standard 40mm field-of-view (FOV) yields an effective resolution signif- icantly lower than that of the 24MP+ full-frame Sony A7 used in our rig. This loss of high-frequency detail would severely compromise the evaluation of texture preservation and generation

  52. [52]

    More critically, they lack the dynamic range of the Sony A7’s 14-bit RAW optical path

    Image Quality & Dynamic Range:Panoramic cameras typically utilize smaller sensors that introduce noise and chromatic aberration. More critically, they lack the dynamic range of the Sony A7’s 14-bit RAW optical path. High dynamic range is essential for outdoor relighting tasks to accurately recover information in deep shadows and bright highlights

  53. [53]

    METHODOLOGYFORDETERMININGTHENODALPOINT(NO-PARALLAXPOINT) 7 Fig.10:Side-by-side camera setup, a non-aligned envmap camera will record direct sun light, but scene camera records a shadow

  54. [54]

    nodal point,

    Optical Artifacts:360 ◦ cameras rely on heavy distortion correction and stitching algorithms, which introduce resampling artifacts. Using a dedicated rectilinear lens ensures the benchmark data is free from such algorithmic interference. 5.2 Necessity of Strict Spatial Alignment Precise co-location of the environment map camera (Insta360) and the scene ca...

  55. [55]

    in-the-wild

    METHODOLOGYFORDETERMININGTHENODALPOINT(NO-PARALLAXPOINT) 9 the entrance pupil. Adjustments must then be made to the fore-aft position of the camera on the panoramic head, and the rotational test is repeated. The objective is to achieve a state where, upon panning the camera to the left and right, the two reference objects remain in perfect alignment, with...

  56. [56]

    Pairwise Comparison:For each scene, annotators performed a sequential, pairwise comparison of adjacent time steps (e.g.,t0 vs.t 1,t 1 vs.t 2, etc.)

  57. [57]

    Difference Visualization:To aid the human annotators, we generated absolute pixel-difference images for each pair. This visualization technique ef- fectively accentuates the contours of misaligned objects, where pixel gradients are highest, making the boundaries of dynamic elements more conspicuous

  58. [58]

    The primary targets for masking were clouds and moving vegetation (leaves, branches, and grass)

    Manual Annotation:Annotators manually painted masks over all identified dynamic regions for each image pair. The primary targets for masking were clouds and moving vegetation (leaves, branches, and grass)

  59. [59]

    This ensures that any element that moved at any point during the capture sequence is included in the aggregate mask

    Mask Aggregation:The final mask for the entire scene is generated by computing the union of all pairwise masks. This ensures that any element that moved at any point during the capture sequence is included in the aggregate mask. We explicitly excluded two categories of dynamic effects from masking. First, watersurfaces(e.g.,lakesandseas)werenotannotateddu...

  60. [60]

    Let the per-pixel surface properties be defined by basecolorcb, normaln, rough- ness α, and metallicitym, and let the environment illumination be given as a HDR mapLenv(ωi)

    DIFFERENTIABLE COOK–TORRANCE RENDERER 11 7 Differentiable Cook–Torrance Renderer To evaluate the physics consistency of predicted G-buffers, we employ a fully differentiable Cook–Torrance microfacet model with split-sum approximation [5]. Let the per-pixel surface properties be defined by basecolorcb, normaln, rough- ness α, and metallicitym, and let the ...