arxiv: 2604.03462 · v1 · submitted 2026-04-03 · 💻 cs.CV · cs.GR· cs.RO

Recognition: 2 theorem links

· Lean Theorem

SpectralSplat: Appearance-Disentangled Feed-Forward Gaussian Splatting for Driving Scenes

Quentin Herau , Tianshuo Xu , Depu Meng , Jiezhi Yang , Chensheng Peng , Spencer Sherk , Yihan Hu , Wei Zhan

Authors on Pith no claims yet

Pith reviewed 2026-05-13 19:37 UTC · model grok-4.3

classification 💻 cs.CV cs.GRcs.RO

keywords Gaussian SplattingAppearance DisentanglementDriving ScenesRelightingFeed-forward ReconstructionDINOv2Temporal Consistency

0 comments

The pith

SpectralSplat factors color prediction into base and adapted streams to disentangle geometry from appearance in feed-forward Gaussian splatting for driving scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that splitting color output into an appearance-agnostic base stream and an appearance-conditioned adapted stream, both generated by one MLP that takes a global embedding from DINOv2 features, lets the model separate scene geometry from lighting, weather, and time-of-day effects. Training relies on paired observations created by a hybrid pipeline that first applies physics-based decomposition then refines with diffusion, plus four complementary losses that push consistency across appearances. A sympathetic reader would care because this separation would allow the same reconstructed driving scene to be rendered under new lighting or weather conditions without retraining or visible artifacts, while an added temporal buffer of agnostic features keeps accumulated Gaussians stable across frames.

Core claim

SpectralSplat factors color prediction into an appearance-agnostic base stream and an appearance-conditioned adapted stream produced by a shared MLP conditioned on a global appearance embedding derived from DINOv2 features. The model is trained on paired observations generated by a hybrid relighting pipeline that combines physics-based intrinsic decomposition with diffusion-based generative refinement, supervised by complementary consistency, reconstruction, cross-appearance, and base color losses. An appearance-adaptable temporal history stores appearance-agnostic features so that accumulated Gaussians can be re-rendered under arbitrary target appearances. This approach preserves the base 3

What carries the argument

Dual-stream color prediction (appearance-agnostic base plus appearance-conditioned adapted) from a shared MLP conditioned on a global DINOv2 appearance embedding, which factors appearance out of the Gaussian color output to isolate it from geometry.

Load-bearing premise

The hybrid relighting pipeline produces paired observations whose appearance changes are realistic and free of artifacts so the disentanglement losses can separate geometry from appearance without side effects.

What would settle it

Apply the trained model to two real captures of the identical scene taken under different lighting, transfer the appearance embedding from one to the other, and check whether the rendered output shows geometry-consistent colors with no visible artifacts or drop in PSNR compared to the original backbone.

Figures

Figures reproduced from arXiv: 2604.03462 by Chensheng Peng, Depu Meng, Jiezhi Yang, Quentin Herau, Spencer Sherk, Tianshuo Xu, Wei Zhan, Yihan Hu.

**Figure 2.** Figure 2: Training pipeline. Original and augmented images share geometry but produce separate features and appearance embeddings. Four losses enforce disentanglement: Linv (base invariance), Laug (augmented reconstruction), Lswap (crossappearance), and Lbase (base color alignment). ding is derived from the DINOv2 features already extracted by the backbone. Specifically, the projected 2048-dimensional DINOv2 patch… view at source ↗

**Figure 3.** Figure 3: Relighting pipeline. MVInverse + physics rendering is 3D-consistent but flat; IC-Light alone is photorealistic but inconsistent; our hybrid pipeline achieves both. appearance ( [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Cross-appearance results on Waymo. Rows 1–3: source ground truth, adapted render, and base color. Rows 4–6: same for the augmented condition. Base colors from both are nearly identical, confirming appearance-invariance. Rows 7–8: swapping the appearance embedding transfers appearance while preserving geometry. 4.3 Cross-Appearance Evaluation We evaluate appearance disentanglement on the Waymo validation se… view at source ↗

**Figure 5.** Figure 5: t-SNE of appearance embeddings. Embeddings cluster by illumination type, confirming the encoder captures meaningful appearance information. Ground Truth (source) WildGaussians SpectralSplat (ours) [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: WildGaussians vs. SpectralSplat. Both methods are trained on augmented images and evaluated on the original source images. WildGaussians exhibits artifacts in the reconstruction, while SpectralSplat produces cleaner geometry and appearance. to reproduce diffusion artifacts present in the IC-Light training images (halos, unnatural glow). SpectralSplat produces perceptually cleaner renders with better geomet… view at source ↗

**Figure 7.** Figure 7: Appearance transfer grid. Each row renders the same scene geometry under different appearance embeddings. Column 1 : original render. Column 2 : base color (a=0). Columns 3–7 : renders using the appearance embedding from the reference image shown in the top row. The appearance embedding transfers the global color tone and lighting mood while preserving scene geometry. Additional Cross-Appearance Results We… view at source ↗

**Figure 8.** Figure 8: Cross-appearance results on Waymo 2. Rows 1–3: source ground truth, adapted render, and base color. Rows 4–6: same for the augmented condition. Base colors from both are nearly identical, confirming appearance-invariance. Rows 7–8: swapping the appearance embedding transfers appearance while preserving geometry [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: Cross-appearance results on Waymo 3. Rows 1–3: source ground truth, adapted render, and base color. Rows 4–6: same for the augmented condition. Base colors from both are nearly identical, confirming appearance-invariance. Rows 7–8: swapping the appearance embedding transfers appearance while preserving geometry [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

read the original abstract

Feed-forward 3D Gaussian Splatting methods have achieved impressive reconstruction quality for autonomous driving scenes, yet they entangle scene geometry with transient appearance properties such as lighting, weather, and time of day. This coupling prevents relighting, appearance transfer, and consistent rendering across multi-traversal data captured under varying environmental conditions. We present SpectralSplat, a method that disentangles appearance from geometry within a feed-forward Gaussian Splatting framework. Our key insight is to factor color prediction into an appearance-agnostic base stream and and appearance-conditioned adapted stream, both produced by a shared MLP conditioned on a global appearance embedding derived from DINOv2 features. To enforce disentanglement, we train with paired observations generated by a hybrid relighting pipeline that combines physics-based intrinsic decomposition with diffusion based generative refinement, and supervise with complementary consistency, reconstruction, cross-appearance, and base color losses. We further introduce an appearance-adaptable temporal history that stores appearance-agnostic features, enabling accumulated Gaussians to be re-rendered under arbitrary target appearances. Experiments demonstrate that SpectralSplat preserves the reconstruction quality of the underlying backbone while enabling controllable appearance transfer and temporally consistent relighting across driving sequences.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SpectralSplat adds a dual-stream factorization and DINOv2-conditioned temporal history to feed-forward Gaussian splatting so appearance can be swapped without retraining, but the whole thing rests on how clean the hybrid relighting pairs actually are.

read the letter

The paper's core move is to split color prediction into an appearance-agnostic base stream and an adapted stream that both come from the same MLP but get conditioned on a global embedding pulled from DINOv2. They train this on paired observations created by a physics-based intrinsic decomposition followed by diffusion refinement, using consistency, reconstruction, cross-appearance, and base-color losses to push the separation. An appearance-adaptable temporal history then lets accumulated Gaussians render under new conditions while staying consistent across a driving sequence. That combination is new for this exact setting and keeps the original backbone reconstruction quality intact on the examples shown. The temporal history part is a practical addition for multi-traversal data, and the shared-MLP design avoids bloating the model. The qualitative results on lighting and weather transfer look reasonable at first glance. The main soft spot is the relighting pipeline itself. The method needs those generated pairs to be realistic and free of geometry or lighting artifacts, otherwise the disentanglement losses will bake in the wrong signals and the temporal consistency will break. The paper does not appear to include direct metrics or side-by-side checks of the pipeline outputs against real multi-traversal captures, so it is hard to judge how much the claimed separation actually holds up versus how much it relies on the synthetic pairs being unusually clean. Minor implementation details like the exact embedding dimension and loss weighting are also left light on numbers. This is aimed at groups already using feed-forward splatting for driving scenes who want to add controllable appearance without starting from scratch. A reader working on simulation or data augmentation for autonomy would find the architecture and loss design useful even if they end up swapping the relighting step. I would send it to peer review because the idea is coherent, the application is relevant, and the gaps are fixable with added validation rather than fundamental flaws in the approach.

Referee Report

3 major / 3 minor

Summary. The paper presents SpectralSplat, a feed-forward 3D Gaussian Splatting method for driving scenes that disentangles geometry from transient appearance (lighting, weather) by factoring color prediction into an appearance-agnostic base stream and an appearance-conditioned adapted stream, both from a shared MLP conditioned on a global embedding derived from DINOv2 features. Training uses paired observations generated by a hybrid physics-based intrinsic decomposition plus diffusion refinement pipeline, supervised by consistency, reconstruction, cross-appearance, and base color losses; an appearance-adaptable temporal history stores agnostic features to support re-rendering under target appearances. The central claim is that reconstruction quality of the backbone is preserved while enabling controllable appearance transfer and temporally consistent relighting across multi-traversal sequences.

Significance. If the disentanglement holds, the work would meaningfully extend feed-forward Gaussian Splatting to multi-condition driving data by adding controllable relighting and transfer without quality loss or per-scene optimization, with direct utility for simulation and data augmentation. The pragmatic combination of DINOv2 conditioning and hybrid generative pairs is a reasonable engineering choice, and the temporal history mechanism is a clear incremental contribution for sequence consistency.

major comments (3)

[§3.3] §3.3 (Hybrid Relighting Pipeline): The pipeline that generates the paired observations is load-bearing for all disentanglement losses (consistency, cross-appearance, base color), yet the manuscript supplies no equations for the physics-diffusion integration, no implementation parameters, and no quantitative fidelity metrics (PSNR, perceptual scores, or artifact counts) against real multi-traversal captures; without this, it is impossible to confirm that the pairs are free of lighting inconsistencies or diffusion artifacts that would prevent reliable separation of the base and adapted streams.
[§4] §4 (Experiments): The claim that reconstruction quality is preserved is stated but unsupported by any reported tables, PSNR/SSIM/LPIPS numbers, or direct comparisons to the underlying backbone method; likewise, no ablation isolates the contribution of the appearance-adaptable temporal history or the cross-appearance loss, leaving the central quality-preservation and consistency claims unverified.
[§3.2] §3.2 (Losses and Embedding): The global appearance embedding dimension is listed as a free parameter with no sensitivity study or justification; because the shared MLP and all four losses depend on this embedding to separate streams, the absence of robustness checks risks that observed disentanglement is an artifact of the specific pipeline pairs rather than a general property.

minor comments (3)

[Abstract] Abstract: repeated word ('base stream and and appearance-conditioned') should be corrected to 'base stream and an appearance-conditioned'.
[§2] §2 (Related Work): additional citations to recent feed-forward GS driving papers (post-2023) would better situate the novelty of the disentanglement approach.
[Figure 3] Figure 3 (Temporal History): the diagram would benefit from explicit arrows and labels showing how appearance-agnostic features flow into the re-rendering step under a new target embedding.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to incorporate the requested details and experiments.

read point-by-point responses

Referee: [§3.3] §3.3 (Hybrid Relighting Pipeline): The pipeline that generates the paired observations is load-bearing for all disentanglement losses (consistency, cross-appearance, base color), yet the manuscript supplies no equations for the physics-diffusion integration, no implementation parameters, and no quantitative fidelity metrics (PSNR, perceptual scores, or artifact counts) against real multi-traversal captures; without this, it is impossible to confirm that the pairs are free of lighting inconsistencies or diffusion artifacts that would prevent reliable separation of the base and adapted streams.

Authors: We agree that the hybrid relighting pipeline is central and requires fuller documentation. In the revised manuscript we will add the explicit equations for the physics-based intrinsic decomposition step and its integration with the diffusion refinement, list all implementation parameters (diffusion steps, guidance scales, etc.), and report quantitative fidelity metrics (PSNR, LPIPS, and artifact counts) evaluated against held-out real multi-traversal captures to demonstrate that the generated pairs are sufficiently clean for reliable disentanglement. revision: yes
Referee: [§4] §4 (Experiments): The claim that reconstruction quality is preserved is stated but unsupported by any reported tables, PSNR/SSIM/LPIPS numbers, or direct comparisons to the underlying backbone method; likewise, no ablation isolates the contribution of the appearance-adaptable temporal history or the cross-appearance loss, leaving the central quality-preservation and consistency claims unverified.

Authors: We acknowledge the absence of explicit quantitative tables and ablations in the current draft. The revised version will include a results table reporting PSNR, SSIM, and LPIPS for SpectralSplat versus the backbone on the test set, plus dedicated ablations that remove the temporal history and the cross-appearance loss individually, with the corresponding metrics to verify their contributions to quality preservation and consistency. revision: yes
Referee: [§3.2] §3.2 (Losses and Embedding): The global appearance embedding dimension is listed as a free parameter with no sensitivity study or justification; because the shared MLP and all four losses depend on this embedding to separate streams, the absence of robustness checks risks that observed disentanglement is an artifact of the specific pipeline pairs rather than a general property.

Authors: We will add a sensitivity study in the revision that varies the embedding dimension across a range of values (e.g., 128, 256, 512) and reports the resulting disentanglement metrics (cross-appearance consistency and base-color fidelity). This will both justify the chosen dimension and demonstrate that the separation is robust rather than pipeline-specific. revision: yes

Circularity Check

0 steps flagged

No significant circularity: architecture and losses depend on external DINOv2 features and independent relighting pipeline

full rationale

The paper factors color prediction into an appearance-agnostic base stream and appearance-conditioned adapted stream via a shared MLP conditioned on a global embedding from DINOv2 features. Disentanglement is supervised by consistency, reconstruction, cross-appearance, and base color losses applied to paired observations produced by an external hybrid relighting pipeline (physics-based intrinsic decomposition plus diffusion refinement). An appearance-adaptable temporal history stores appearance-agnostic features for re-rendering under target appearances. No quoted equations or claims reduce by construction to fitted parameters, self-citations, or renamed inputs; DINOv2 and the relighting pipeline are independent external components, and the losses are standard supervision terms rather than self-definitional. The preservation of reconstruction quality and controllable transfer therefore follow from the architectural separation without circular equivalence to the method's own inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the separability of geometry and appearance via the proposed streams and on the fidelity of the external relighting pipeline; no new physical entities are introduced.

free parameters (1)

global appearance embedding dimension
Dimensionality and integration of the DINOv2-derived embedding are chosen to condition the MLP and may be tuned to the dataset.

axioms (1)

domain assumption Color can be factored into an appearance-agnostic base stream and an appearance-conditioned adapted stream without loss of geometric fidelity.
Invoked in the key insight section of the abstract to justify the dual-stream design.

pith-pipeline@v0.9.0 · 5538 in / 1378 out tokens · 46734 ms · 2026-05-13T19:37:41.335099+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

factor color prediction into an appearance-agnostic base stream and an appearance-conditioned adapted stream, both produced by a shared MLP conditioned on a global appearance embedding derived from DINOv2 features... supervise with complementary consistency, reconstruction, cross-appearance, and base color losses
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat zero as identity echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Base Invariance... L_inv = ||Î_base_src − Î_base_aug||²₂

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 1 internal anchor

[1]

In: CVPR

Barron, J.T., Mildenhall, B., Verbin, D., Srinivasan, P.P., Hedman, P.: Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In: CVPR. pp. 5470–5479 (2022)

work page 2022
[2]

In: ICCV

Barron, J.T., Mildenhall, B., Verbin, D., Srinivasan, P.P., Hedman, P.: Zip-nerf: Anti-aliased grid-based neural radiance fields. In: ICCV. pp. 19697–19705 (2023)

work page 2023
[3]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., Beijbom, O.: nuscenes: A multimodal dataset for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 11621–11631 (2020)

work page 2020
[4]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Charatan,D.,Li,S.L.,Tagliasacchi,A.,Sitzmann,V.:pixelsplat:3dgaussiansplats from image pairs for scalable generalizable 3d reconstruction. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 19457– 19467 (2024)

work page 2024
[5]

In: European conference on computer vision

Chen, Y., Xu, H., Zheng, C., Zhuang, B., Pollefeys, M., Geiger, A., Cham, T.J., Cai, J.: Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images. In: European conference on computer vision. pp. 370–386. Springer (2024)

work page 2024
[6]

Chen, Z., Yang, J., Yang, H.: Pref3r: Pose-free feed-forward 3d gaussian splatting from variable-length image sequence (2024),https://arxiv.org/abs/2411.16877

work page arXiv 2024
[7]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Chen, Z., Xu, T., Ge, W., Wu, L., Yan, D., He, J., Wang, L., Zeng, L., Zhang, S., Chen, Y.C.: Uni-renderer: Unifying rendering and inverse rendering via dual stream diffusion. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 26504–26513 (2025)

work page 2025
[8]

In: European Confer- ence on Computer Vision

Dahmani, H., Bennehar, M., Piasco, N., Roldao, L., Tsishkou, D.: Swag: Splatting in the wild images with appearance-conditioned gaussians. In: European Confer- ence on Computer Vision. pp. 325–340. Springer (2024)

work page 2024
[9]

TPAMI 32(8), 1362–1376 (2009)

Furukawa, Y., Ponce, J.: Accurate, dense, and robust multiview stereopsis. TPAMI 32(8), 1362–1376 (2009)

work page 2009
[10]

In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

Herau, Q., Bennehar, M., Moreau, A., Piasco, N., Roldão, L., Tsishkou, D., Migniot, C., Vasseur, P., Demonceaux, C.: 3dgs-calib: 3d gaussian splatting for multimodal spatiotemporal calibration. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 8315–8321 (2024)

work page 2024
[11]

arXiv preprint arXiv:2504.15776 (2025)

Herau, Q., Piasco, N., Bennehar, M., Roldão, L., Tsishkou, D., Liu, B., Migniot, C., Vasseur, P., Demonceaux, C.: Pose optimization for autonomous driving datasets using neural rendering models. arXiv preprint arXiv:2504.15776 (2025)

work page arXiv 2025
[12]

In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

Herau, Q., Piasco, N., Bennehar, M., Roldao, L., Tsishkou, D., Migniot, C., Vasseur, P., Demonceaux, C.: Moisst: Multimodal optimization of implicit scene for spatiotemporal calibration. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 1810–1817 (2023)

work page 2023
[13]

In: CVPR

Herau, Q., Piasco, N., Bennehar, M., Roldao, L., Tsishkou, D., Migniot, C., Vasseur, P., Demonceaux, C.: Soac: Spatio-temporal overlap-aware multi-sensor calibration using neural radiance fields. In: CVPR. pp. 15131–15140 (2024) 16 Q. Herau et al

work page 2024
[14]

In: ACM SIGGRAPH

Huang, B., Yu, Z., Chen, A., Geiger, A., Gao, S.: 2d gaussian splatting for geo- metrically accurate radiance fields. In: ACM SIGGRAPH. pp. 1–11 (2024)

work page 2024
[15]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with condi- tional adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1125–1134 (2017)

work page 2017
[16]

ACM Transactions on Graphics (TOG)44(6), 1–16 (2025)

Jiang, L., Mao, Y., Xu, L., Lu, T., Ren, K., Jin, Y., Xu, X., Yu, M., Pang, J., Zhao, F., et al.: Anysplat: Feed-forward 3d gaussian splatting from unconstrained views. ACM Transactions on Graphics (TOG)44(6), 1–16 (2025)

work page 2025
[17]

TOG42(4), 139–1 (2023)

Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. TOG42(4), 139–1 (2023)

work page 2023
[18]

In: Advances in Neural Information Processing Systems (NeurIPS) (2024)

Kulhanek, J., Peng, S., Kukelova, Z., Pollefeys, M., Sattler, T.: Wildgaussians: 3d gaussian splatting in the wild. In: Advances in Neural Information Processing Systems (NeurIPS) (2024)

work page 2024
[19]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Liu, K., Zhan, F., Chen, Y., Zhang, J., Yu, Y., El Saddik, A., Lu, S., Xing, E.P.: Stylerf: Zero-shot 3d style transfer of neural radiance fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 8338–8348 (2023)

work page 2023
[20]

arXiv preprint arXiv:2412.09043 (2024)

Lu, H., Xu, T., Zheng, W., Zhang, Y., Zhan, W., Du, D., Tomizuka, M., Keutzer, K., Chen, Y.: Drivingrecon: Large 4d gaussian reconstruction model for au- tonomous driving. arXiv preprint arXiv:2412.09043 (2024)

work page arXiv 2024
[21]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Martin-Brualla, R., Radwan, N., Sajjadi, M.S.M., Barron, J.T., Dosovitskiy, A., Duckworth, D.: Nerf in the wild: Neural radiance fields for unconstrained photo collections. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 7210–7219 (2021)

work page 2021
[22]

Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.:Nerf:Representingscenesasneuralradiancefieldsforviewsynthesis.In:ECCV. pp. 405–421 (2020)

work page 2020
[23]

Transactions on Ma- chine Learning Research (2024)

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fer- nandez, P., Haziza, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, R., Huang, P.Y., Li, S.W., Misra, I., Rabbat, M., Sharma, V., Syn- naeve, G., Xu, H., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: Dinov2: Learning robust visual fe...

work page 2024
[24]

ACM Transactions on Graphics (TOG)40(4), 43:1–43:21 (2021)

Pandey, R., Orts-Escolano, S., LeGendre, C., Häne, C., Bouaziz, S., Rhemann, C., Debevec, P., Fanello, S.: Total relighting: Learning to relight portraits for back- ground replacement. ACM Transactions on Graphics (TOG)40(4), 43:1–43:21 (2021)

work page 2021
[25]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)

work page 2022
[26]

In: CVPR

Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: CVPR. pp. 4104–4113 (2016)

work page 2016
[27]

In: ECCV

Schönberger, J.L., Zheng, E., Frahm, J.M., Pollefeys, M.: Pixelwise view selection for unstructured multi-view stereo. In: ECCV. pp. 501–518 (2016)

work page 2016
[28]

arXiv preprint arXiv:2511.04595 (2025)

Shi, C., Shi, S., Lyu, X., Liu, C., Sheng, K., Zhang, B., Jiang, L.: Unisplat: Unified spatio-temporal fusion via 3d latent scaffolds for dynamic driving scene reconstruc- tion. arXiv preprint arXiv:2511.04595 (2025)

work page arXiv 2025
[29]

13912 SpectralSplat 17

Smart, B., Zheng, C., Laina, I., Prisacariu, V.A.: Splatt3r: Zero-shot gaussian splatting from uncalibrated image pairs (2024),https://arxiv.org/abs/2408. 13912 SpectralSplat 17

work page 2024
[30]

TOG25(3), 835–846 (2006)

Snavely, N., Seitz, S.M., Szeliski, R.: Photo tourism: exploring photo collections in 3d. TOG25(3), 835–846 (2006)

work page 2006
[31]

ICLR (2021)

Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. ICLR (2021)

work page 2021
[32]

In: CVPR

Sun, P., Kretzschmar, H., Dotiwalla, X., Chouard, A., Patnaik, V., Tsui, P., Guo, J., Zhou, Y., Chai, Y., Caine, B., et al.: Scalability in perception for autonomous driving: Waymo open dataset. In: CVPR. pp. 2446–2454 (2020)

work page 2020
[33]

Sun, T., Barron, J.T., Tsai, Y.T., Xu, Z., Yu, X., Fyffe, G., Rhemann, C., Busch, J., Debevec,P.,Ramamoorthi,R.:Single imageportraitrelighting.ACMTransactions on Graphics (TOG)38(4), 79:1–79:12 (2019)

work page 2019
[34]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Szymanowicz, S., Rupprecht, C., Vedaldi, A.: Splatter image: Ultra-fast single- view 3d reconstruction. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10208–10217 (2024)

work page 2024
[35]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Tancik, M., Casser, V., Yan, X., Pradhan, S., Mildenhall, B., Srinivasan, P.P., Bar- ron, J.T., Kretzschmar, H.: Block-nerf: Scalable large scene neural view synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 8248–8258 (2022)

work page 2022
[36]

In: 2025 Interna- tional Conference on 3D Vision (3DV)

Wang, H., Agapito, L.: 3d reconstruction with spatial memory. In: 2025 Interna- tional Conference on 3D Vision (3DV). pp. 78–89. IEEE (2025)

work page 2025
[37]

In: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (2025)

Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer. In: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (2025)

work page 2025
[38]

In: NIPS

Wang, P., Liu, L., Liu, Y., Theobalt, C., Komura, T., Wang, W.: Neus: learning neural implicit surfaces by volume rendering for multi-view reconstruction. In: NIPS. pp. 27171–27183 (2021)

work page 2021
[39]

Wang, Y., Zhou, J., Zhu, H., Chang, W., Zhou, Y., Li, Z., Chen, J., Pang, J., Shen, C., He, T.:π 3: Permutation-equivariant visual geometry learning (2025), https://arxiv.org/abs/2507.13347

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

IEEE Transactions on Image Process- ing13(4), 600–612 (2004)

Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Process- ing13(4), 600–612 (2004)

work page 2004
[41]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

Wei, D., Li, Z., Liu, P.: Omni-scene: Omni-gaussian representation for ego-centric sparse-view scene reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

work page 2025
[42]

MVInverse: Feed-forward multi-view inverse rendering in seconds.arXiv preprint arXiv:2512.21003, 2025

Wu, X., Ren, C., Zhou, J., Li, X., Liu, Y.: Mvinverse: Feed-forward multi-view inverse rendering in seconds. arXiv preprint arXiv:2512.21003 (2025)

work page arXiv 2025
[43]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

Xu, H., Peng, S., Wang, F., Blum, H., Barath, D., Geiger, A., Pollefeys, M.: Depth- splat: Connecting gaussian splatting and depth. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

work page 2025
[44]

In: European Conference on Computer Vision

Yang,J.,Desai,K.,Packer,C.,Bhatia,H.,Rhinehart,N.,McAllister,R.,Gonzalez, J.E.: Carff: Conditional auto-encoded radiance field for 3d scene forecasting. In: European Conference on Computer Vision. pp. 225–242. Springer (2024)

work page 2024
[45]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

Yang,Z.,Chen,Y.,Wang,J.,Manivasagam,S.,Ma,W.C.,Yang,A.J.,Urtasun,R.: Unisim: A neural closed-loop sensor simulator. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

work page 2023
[46]

arXiv preprint arXiv:2410.24207 (2024)

Ye, B., Liu, S., Xu, H., Li, X., Pollefeys, M., Yang, M.H., Peng, S.: No pose, no problem: Surprisingly simple 3d gaussian splats from sparse unposed images. arXiv preprint arXiv:2410.24207 (2024)

work page arXiv 2024
[47]

In: Proceedings of the European Conference on Computer Vision (ECCV) (2022) 18 Q

Zhang, K., Kolkin, N., Bi, S., Luan, F., Xu, Z., Shechtman, E., Snavely, N.: Arf: Artistic radiance fields. In: Proceedings of the European Conference on Computer Vision (ECCV) (2022) 18 Q. Herau et al

work page 2022
[48]

In: ICLR (2025)

Zhang, L., Rao, A., Agrawala, M.: Scaling in-the-wild training for diffusion-based illumination harmonization and editing by imposing consistent light transport. In: ICLR (2025)

work page 2025
[49]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effec- tiveness of deep features as a perceptual metric. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 586–595 (2018)

work page 2018
[50]

In: Proceedings of the IEEE Interna- tional Conference on Computer Vision (ICCV)

Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE Interna- tional Conference on Computer Vision (ICCV). pp. 2223–2232 (2017) SpectralSplat 19 Supplementary Material Backbone Architecture Details We provide additional details on the UniSplat [28] backbon...

work page 2017