pith. machine review for the scientific record. sign in

arxiv: 2604.03462 · v1 · submitted 2026-04-03 · 💻 cs.CV · cs.GR· cs.RO

Recognition: 2 theorem links

· Lean Theorem

SpectralSplat: Appearance-Disentangled Feed-Forward Gaussian Splatting for Driving Scenes

Authors on Pith no claims yet

Pith reviewed 2026-05-13 19:37 UTC · model grok-4.3

classification 💻 cs.CV cs.GRcs.RO
keywords Gaussian SplattingAppearance DisentanglementDriving ScenesRelightingFeed-forward ReconstructionDINOv2Temporal Consistency
0
0 comments X

The pith

SpectralSplat factors color prediction into base and adapted streams to disentangle geometry from appearance in feed-forward Gaussian splatting for driving scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that splitting color output into an appearance-agnostic base stream and an appearance-conditioned adapted stream, both generated by one MLP that takes a global embedding from DINOv2 features, lets the model separate scene geometry from lighting, weather, and time-of-day effects. Training relies on paired observations created by a hybrid pipeline that first applies physics-based decomposition then refines with diffusion, plus four complementary losses that push consistency across appearances. A sympathetic reader would care because this separation would allow the same reconstructed driving scene to be rendered under new lighting or weather conditions without retraining or visible artifacts, while an added temporal buffer of agnostic features keeps accumulated Gaussians stable across frames.

Core claim

SpectralSplat factors color prediction into an appearance-agnostic base stream and an appearance-conditioned adapted stream produced by a shared MLP conditioned on a global appearance embedding derived from DINOv2 features. The model is trained on paired observations generated by a hybrid relighting pipeline that combines physics-based intrinsic decomposition with diffusion-based generative refinement, supervised by complementary consistency, reconstruction, cross-appearance, and base color losses. An appearance-adaptable temporal history stores appearance-agnostic features so that accumulated Gaussians can be re-rendered under arbitrary target appearances. This approach preserves the base 3

What carries the argument

Dual-stream color prediction (appearance-agnostic base plus appearance-conditioned adapted) from a shared MLP conditioned on a global DINOv2 appearance embedding, which factors appearance out of the Gaussian color output to isolate it from geometry.

Load-bearing premise

The hybrid relighting pipeline produces paired observations whose appearance changes are realistic and free of artifacts so the disentanglement losses can separate geometry from appearance without side effects.

What would settle it

Apply the trained model to two real captures of the identical scene taken under different lighting, transfer the appearance embedding from one to the other, and check whether the rendered output shows geometry-consistent colors with no visible artifacts or drop in PSNR compared to the original backbone.

Figures

Figures reproduced from arXiv: 2604.03462 by Chensheng Peng, Depu Meng, Jiezhi Yang, Quentin Herau, Spencer Sherk, Tianshuo Xu, Wei Zhan, Yihan Hu.

Figure 1
Figure 1. Figure 1: Appearance-disentangled Gaussian reconstruction. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Training pipeline. Original and augmented images share geometry but produce separate features and appearance embeddings. Four losses enforce disen￾tanglement: Linv (base invariance), Laug (augmented reconstruction), Lswap (cross￾appearance), and Lbase (base color alignment). ding is derived from the DINOv2 features already extracted by the backbone. Specifically, the projected 2048-dimensional DINOv2 patch… view at source ↗
Figure 3
Figure 3. Figure 3: Relighting pipeline. MVInverse + physics rendering is 3D-consistent but flat; IC-Light alone is photorealistic but inconsistent; our hybrid pipeline achieves both. appearance ( [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Cross-appearance results on Waymo. Rows 1–3: source ground truth, adapted render, and base color. Rows 4–6: same for the augmented condition. Base colors from both are nearly identical, confirming appearance-invariance. Rows 7–8: swapping the appearance embedding transfers appearance while preserving geometry. 4.3 Cross-Appearance Evaluation We evaluate appearance disentanglement on the Waymo validation se… view at source ↗
Figure 5
Figure 5. Figure 5: t-SNE of appearance embeddings. Embeddings cluster by illumination type, confirming the encoder captures meaningful appearance information. Ground Truth (source) WildGaussians SpectralSplat (ours) [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: WildGaussians vs. SpectralSplat. Both methods are trained on augmented images and evaluated on the original source images. WildGaussians exhibits artifacts in the reconstruction, while SpectralSplat produces cleaner geometry and appearance. to reproduce diffusion artifacts present in the IC-Light training images (halos, unnatural glow). SpectralSplat produces perceptually cleaner renders with better geomet… view at source ↗
Figure 7
Figure 7. Figure 7: Appearance transfer grid. Each row renders the same scene geometry under different appearance embeddings. Column 1 : original render. Column 2 : base color (a=0). Columns 3–7 : renders using the appearance embedding from the reference image shown in the top row. The appearance embedding transfers the global color tone and lighting mood while preserving scene geometry. Additional Cross-Appearance Results We… view at source ↗
Figure 8
Figure 8. Figure 8: Cross-appearance results on Waymo 2. Rows 1–3: source ground truth, adapted render, and base color. Rows 4–6: same for the augmented condition. Base colors from both are nearly identical, confirming appearance-invariance. Rows 7–8: swapping the appearance embedding transfers appearance while preserving geometry [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Cross-appearance results on Waymo 3. Rows 1–3: source ground truth, adapted render, and base color. Rows 4–6: same for the augmented condition. Base colors from both are nearly identical, confirming appearance-invariance. Rows 7–8: swapping the appearance embedding transfers appearance while preserving geometry [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
read the original abstract

Feed-forward 3D Gaussian Splatting methods have achieved impressive reconstruction quality for autonomous driving scenes, yet they entangle scene geometry with transient appearance properties such as lighting, weather, and time of day. This coupling prevents relighting, appearance transfer, and consistent rendering across multi-traversal data captured under varying environmental conditions. We present SpectralSplat, a method that disentangles appearance from geometry within a feed-forward Gaussian Splatting framework. Our key insight is to factor color prediction into an appearance-agnostic base stream and and appearance-conditioned adapted stream, both produced by a shared MLP conditioned on a global appearance embedding derived from DINOv2 features. To enforce disentanglement, we train with paired observations generated by a hybrid relighting pipeline that combines physics-based intrinsic decomposition with diffusion based generative refinement, and supervise with complementary consistency, reconstruction, cross-appearance, and base color losses. We further introduce an appearance-adaptable temporal history that stores appearance-agnostic features, enabling accumulated Gaussians to be re-rendered under arbitrary target appearances. Experiments demonstrate that SpectralSplat preserves the reconstruction quality of the underlying backbone while enabling controllable appearance transfer and temporally consistent relighting across driving sequences.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper presents SpectralSplat, a feed-forward 3D Gaussian Splatting method for driving scenes that disentangles geometry from transient appearance (lighting, weather) by factoring color prediction into an appearance-agnostic base stream and an appearance-conditioned adapted stream, both from a shared MLP conditioned on a global embedding derived from DINOv2 features. Training uses paired observations generated by a hybrid physics-based intrinsic decomposition plus diffusion refinement pipeline, supervised by consistency, reconstruction, cross-appearance, and base color losses; an appearance-adaptable temporal history stores agnostic features to support re-rendering under target appearances. The central claim is that reconstruction quality of the backbone is preserved while enabling controllable appearance transfer and temporally consistent relighting across multi-traversal sequences.

Significance. If the disentanglement holds, the work would meaningfully extend feed-forward Gaussian Splatting to multi-condition driving data by adding controllable relighting and transfer without quality loss or per-scene optimization, with direct utility for simulation and data augmentation. The pragmatic combination of DINOv2 conditioning and hybrid generative pairs is a reasonable engineering choice, and the temporal history mechanism is a clear incremental contribution for sequence consistency.

major comments (3)
  1. [§3.3] §3.3 (Hybrid Relighting Pipeline): The pipeline that generates the paired observations is load-bearing for all disentanglement losses (consistency, cross-appearance, base color), yet the manuscript supplies no equations for the physics-diffusion integration, no implementation parameters, and no quantitative fidelity metrics (PSNR, perceptual scores, or artifact counts) against real multi-traversal captures; without this, it is impossible to confirm that the pairs are free of lighting inconsistencies or diffusion artifacts that would prevent reliable separation of the base and adapted streams.
  2. [§4] §4 (Experiments): The claim that reconstruction quality is preserved is stated but unsupported by any reported tables, PSNR/SSIM/LPIPS numbers, or direct comparisons to the underlying backbone method; likewise, no ablation isolates the contribution of the appearance-adaptable temporal history or the cross-appearance loss, leaving the central quality-preservation and consistency claims unverified.
  3. [§3.2] §3.2 (Losses and Embedding): The global appearance embedding dimension is listed as a free parameter with no sensitivity study or justification; because the shared MLP and all four losses depend on this embedding to separate streams, the absence of robustness checks risks that observed disentanglement is an artifact of the specific pipeline pairs rather than a general property.
minor comments (3)
  1. [Abstract] Abstract: repeated word ('base stream and and appearance-conditioned') should be corrected to 'base stream and an appearance-conditioned'.
  2. [§2] §2 (Related Work): additional citations to recent feed-forward GS driving papers (post-2023) would better situate the novelty of the disentanglement approach.
  3. [Figure 3] Figure 3 (Temporal History): the diagram would benefit from explicit arrows and labels showing how appearance-agnostic features flow into the re-rendering step under a new target embedding.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to incorporate the requested details and experiments.

read point-by-point responses
  1. Referee: [§3.3] §3.3 (Hybrid Relighting Pipeline): The pipeline that generates the paired observations is load-bearing for all disentanglement losses (consistency, cross-appearance, base color), yet the manuscript supplies no equations for the physics-diffusion integration, no implementation parameters, and no quantitative fidelity metrics (PSNR, perceptual scores, or artifact counts) against real multi-traversal captures; without this, it is impossible to confirm that the pairs are free of lighting inconsistencies or diffusion artifacts that would prevent reliable separation of the base and adapted streams.

    Authors: We agree that the hybrid relighting pipeline is central and requires fuller documentation. In the revised manuscript we will add the explicit equations for the physics-based intrinsic decomposition step and its integration with the diffusion refinement, list all implementation parameters (diffusion steps, guidance scales, etc.), and report quantitative fidelity metrics (PSNR, LPIPS, and artifact counts) evaluated against held-out real multi-traversal captures to demonstrate that the generated pairs are sufficiently clean for reliable disentanglement. revision: yes

  2. Referee: [§4] §4 (Experiments): The claim that reconstruction quality is preserved is stated but unsupported by any reported tables, PSNR/SSIM/LPIPS numbers, or direct comparisons to the underlying backbone method; likewise, no ablation isolates the contribution of the appearance-adaptable temporal history or the cross-appearance loss, leaving the central quality-preservation and consistency claims unverified.

    Authors: We acknowledge the absence of explicit quantitative tables and ablations in the current draft. The revised version will include a results table reporting PSNR, SSIM, and LPIPS for SpectralSplat versus the backbone on the test set, plus dedicated ablations that remove the temporal history and the cross-appearance loss individually, with the corresponding metrics to verify their contributions to quality preservation and consistency. revision: yes

  3. Referee: [§3.2] §3.2 (Losses and Embedding): The global appearance embedding dimension is listed as a free parameter with no sensitivity study or justification; because the shared MLP and all four losses depend on this embedding to separate streams, the absence of robustness checks risks that observed disentanglement is an artifact of the specific pipeline pairs rather than a general property.

    Authors: We will add a sensitivity study in the revision that varies the embedding dimension across a range of values (e.g., 128, 256, 512) and reports the resulting disentanglement metrics (cross-appearance consistency and base-color fidelity). This will both justify the chosen dimension and demonstrate that the separation is robust rather than pipeline-specific. revision: yes

Circularity Check

0 steps flagged

No significant circularity: architecture and losses depend on external DINOv2 features and independent relighting pipeline

full rationale

The paper factors color prediction into an appearance-agnostic base stream and appearance-conditioned adapted stream via a shared MLP conditioned on a global embedding from DINOv2 features. Disentanglement is supervised by consistency, reconstruction, cross-appearance, and base color losses applied to paired observations produced by an external hybrid relighting pipeline (physics-based intrinsic decomposition plus diffusion refinement). An appearance-adaptable temporal history stores appearance-agnostic features for re-rendering under target appearances. No quoted equations or claims reduce by construction to fitted parameters, self-citations, or renamed inputs; DINOv2 and the relighting pipeline are independent external components, and the losses are standard supervision terms rather than self-definitional. The preservation of reconstruction quality and controllable transfer therefore follow from the architectural separation without circular equivalence to the method's own inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the separability of geometry and appearance via the proposed streams and on the fidelity of the external relighting pipeline; no new physical entities are introduced.

free parameters (1)
  • global appearance embedding dimension
    Dimensionality and integration of the DINOv2-derived embedding are chosen to condition the MLP and may be tuned to the dataset.
axioms (1)
  • domain assumption Color can be factored into an appearance-agnostic base stream and an appearance-conditioned adapted stream without loss of geometric fidelity.
    Invoked in the key insight section of the abstract to justify the dual-stream design.

pith-pipeline@v0.9.0 · 5538 in / 1378 out tokens · 46734 ms · 2026-05-13T19:37:41.335099+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    factor color prediction into an appearance-agnostic base stream and an appearance-conditioned adapted stream, both produced by a shared MLP conditioned on a global appearance embedding derived from DINOv2 features... supervise with complementary consistency, reconstruction, cross-appearance, and base color losses

  • IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat zero as identity echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    Base Invariance... L_inv = ||Î_base_src − Î_base_aug||²₂

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 1 internal anchor

  1. [1]

    In: CVPR

    Barron, J.T., Mildenhall, B., Verbin, D., Srinivasan, P.P., Hedman, P.: Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In: CVPR. pp. 5470–5479 (2022)

  2. [2]

    In: ICCV

    Barron, J.T., Mildenhall, B., Verbin, D., Srinivasan, P.P., Hedman, P.: Zip-nerf: Anti-aliased grid-based neural radiance fields. In: ICCV. pp. 19697–19705 (2023)

  3. [3]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., Beijbom, O.: nuscenes: A multimodal dataset for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 11621–11631 (2020)

  4. [4]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Charatan,D.,Li,S.L.,Tagliasacchi,A.,Sitzmann,V.:pixelsplat:3dgaussiansplats from image pairs for scalable generalizable 3d reconstruction. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 19457– 19467 (2024)

  5. [5]

    In: European conference on computer vision

    Chen, Y., Xu, H., Zheng, C., Zhuang, B., Pollefeys, M., Geiger, A., Cham, T.J., Cai, J.: Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images. In: European conference on computer vision. pp. 370–386. Springer (2024)

  6. [6]

    Chen, Z., Yang, J., Yang, H.: Pref3r: Pose-free feed-forward 3d gaussian splatting from variable-length image sequence (2024),https://arxiv.org/abs/2411.16877

  7. [7]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Chen, Z., Xu, T., Ge, W., Wu, L., Yan, D., He, J., Wang, L., Zeng, L., Zhang, S., Chen, Y.C.: Uni-renderer: Unifying rendering and inverse rendering via dual stream diffusion. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 26504–26513 (2025)

  8. [8]

    In: European Confer- ence on Computer Vision

    Dahmani, H., Bennehar, M., Piasco, N., Roldao, L., Tsishkou, D.: Swag: Splatting in the wild images with appearance-conditioned gaussians. In: European Confer- ence on Computer Vision. pp. 325–340. Springer (2024)

  9. [9]

    TPAMI 32(8), 1362–1376 (2009)

    Furukawa, Y., Ponce, J.: Accurate, dense, and robust multiview stereopsis. TPAMI 32(8), 1362–1376 (2009)

  10. [10]

    In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

    Herau, Q., Bennehar, M., Moreau, A., Piasco, N., Roldão, L., Tsishkou, D., Migniot, C., Vasseur, P., Demonceaux, C.: 3dgs-calib: 3d gaussian splatting for multimodal spatiotemporal calibration. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 8315–8321 (2024)

  11. [11]

    arXiv preprint arXiv:2504.15776 (2025)

    Herau, Q., Piasco, N., Bennehar, M., Roldão, L., Tsishkou, D., Liu, B., Migniot, C., Vasseur, P., Demonceaux, C.: Pose optimization for autonomous driving datasets using neural rendering models. arXiv preprint arXiv:2504.15776 (2025)

  12. [12]

    In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

    Herau, Q., Piasco, N., Bennehar, M., Roldao, L., Tsishkou, D., Migniot, C., Vasseur, P., Demonceaux, C.: Moisst: Multimodal optimization of implicit scene for spatiotemporal calibration. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 1810–1817 (2023)

  13. [13]

    In: CVPR

    Herau, Q., Piasco, N., Bennehar, M., Roldao, L., Tsishkou, D., Migniot, C., Vasseur, P., Demonceaux, C.: Soac: Spatio-temporal overlap-aware multi-sensor calibration using neural radiance fields. In: CVPR. pp. 15131–15140 (2024) 16 Q. Herau et al

  14. [14]

    In: ACM SIGGRAPH

    Huang, B., Yu, Z., Chen, A., Geiger, A., Gao, S.: 2d gaussian splatting for geo- metrically accurate radiance fields. In: ACM SIGGRAPH. pp. 1–11 (2024)

  15. [15]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with condi- tional adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1125–1134 (2017)

  16. [16]

    ACM Transactions on Graphics (TOG)44(6), 1–16 (2025)

    Jiang, L., Mao, Y., Xu, L., Lu, T., Ren, K., Jin, Y., Xu, X., Yu, M., Pang, J., Zhao, F., et al.: Anysplat: Feed-forward 3d gaussian splatting from unconstrained views. ACM Transactions on Graphics (TOG)44(6), 1–16 (2025)

  17. [17]

    TOG42(4), 139–1 (2023)

    Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. TOG42(4), 139–1 (2023)

  18. [18]

    In: Advances in Neural Information Processing Systems (NeurIPS) (2024)

    Kulhanek, J., Peng, S., Kukelova, Z., Pollefeys, M., Sattler, T.: Wildgaussians: 3d gaussian splatting in the wild. In: Advances in Neural Information Processing Systems (NeurIPS) (2024)

  19. [19]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Liu, K., Zhan, F., Chen, Y., Zhang, J., Yu, Y., El Saddik, A., Lu, S., Xing, E.P.: Stylerf: Zero-shot 3d style transfer of neural radiance fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 8338–8348 (2023)

  20. [20]

    arXiv preprint arXiv:2412.09043 (2024)

    Lu, H., Xu, T., Zheng, W., Zhang, Y., Zhan, W., Du, D., Tomizuka, M., Keutzer, K., Chen, Y.: Drivingrecon: Large 4d gaussian reconstruction model for au- tonomous driving. arXiv preprint arXiv:2412.09043 (2024)

  21. [21]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Martin-Brualla, R., Radwan, N., Sajjadi, M.S.M., Barron, J.T., Dosovitskiy, A., Duckworth, D.: Nerf in the wild: Neural radiance fields for unconstrained photo collections. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 7210–7219 (2021)

  22. [22]

    Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.:Nerf:Representingscenesasneuralradiancefieldsforviewsynthesis.In:ECCV. pp. 405–421 (2020)

  23. [23]

    Transactions on Ma- chine Learning Research (2024)

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fer- nandez, P., Haziza, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, R., Huang, P.Y., Li, S.W., Misra, I., Rabbat, M., Sharma, V., Syn- naeve, G., Xu, H., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: Dinov2: Learning robust visual fe...

  24. [24]

    ACM Transactions on Graphics (TOG)40(4), 43:1–43:21 (2021)

    Pandey, R., Orts-Escolano, S., LeGendre, C., Häne, C., Bouaziz, S., Rhemann, C., Debevec, P., Fanello, S.: Total relighting: Learning to relight portraits for back- ground replacement. ACM Transactions on Graphics (TOG)40(4), 43:1–43:21 (2021)

  25. [25]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)

  26. [26]

    In: CVPR

    Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: CVPR. pp. 4104–4113 (2016)

  27. [27]

    In: ECCV

    Schönberger, J.L., Zheng, E., Frahm, J.M., Pollefeys, M.: Pixelwise view selection for unstructured multi-view stereo. In: ECCV. pp. 501–518 (2016)

  28. [28]

    arXiv preprint arXiv:2511.04595 (2025)

    Shi, C., Shi, S., Lyu, X., Liu, C., Sheng, K., Zhang, B., Jiang, L.: Unisplat: Unified spatio-temporal fusion via 3d latent scaffolds for dynamic driving scene reconstruc- tion. arXiv preprint arXiv:2511.04595 (2025)

  29. [29]

    13912 SpectralSplat 17

    Smart, B., Zheng, C., Laina, I., Prisacariu, V.A.: Splatt3r: Zero-shot gaussian splatting from uncalibrated image pairs (2024),https://arxiv.org/abs/2408. 13912 SpectralSplat 17

  30. [30]

    TOG25(3), 835–846 (2006)

    Snavely, N., Seitz, S.M., Szeliski, R.: Photo tourism: exploring photo collections in 3d. TOG25(3), 835–846 (2006)

  31. [31]

    ICLR (2021)

    Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. ICLR (2021)

  32. [32]

    In: CVPR

    Sun, P., Kretzschmar, H., Dotiwalla, X., Chouard, A., Patnaik, V., Tsui, P., Guo, J., Zhou, Y., Chai, Y., Caine, B., et al.: Scalability in perception for autonomous driving: Waymo open dataset. In: CVPR. pp. 2446–2454 (2020)

  33. [33]

    Sun, T., Barron, J.T., Tsai, Y.T., Xu, Z., Yu, X., Fyffe, G., Rhemann, C., Busch, J., Debevec,P.,Ramamoorthi,R.:Single imageportraitrelighting.ACMTransactions on Graphics (TOG)38(4), 79:1–79:12 (2019)

  34. [34]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Szymanowicz, S., Rupprecht, C., Vedaldi, A.: Splatter image: Ultra-fast single- view 3d reconstruction. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10208–10217 (2024)

  35. [35]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Tancik, M., Casser, V., Yan, X., Pradhan, S., Mildenhall, B., Srinivasan, P.P., Bar- ron, J.T., Kretzschmar, H.: Block-nerf: Scalable large scene neural view synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 8248–8258 (2022)

  36. [36]

    In: 2025 Interna- tional Conference on 3D Vision (3DV)

    Wang, H., Agapito, L.: 3d reconstruction with spatial memory. In: 2025 Interna- tional Conference on 3D Vision (3DV). pp. 78–89. IEEE (2025)

  37. [37]

    In: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (2025)

    Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer. In: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (2025)

  38. [38]

    In: NIPS

    Wang, P., Liu, L., Liu, Y., Theobalt, C., Komura, T., Wang, W.: Neus: learning neural implicit surfaces by volume rendering for multi-view reconstruction. In: NIPS. pp. 27171–27183 (2021)

  39. [39]

    Wang, Y., Zhou, J., Zhu, H., Chang, W., Zhou, Y., Li, Z., Chen, J., Pang, J., Shen, C., He, T.:π 3: Permutation-equivariant visual geometry learning (2025), https://arxiv.org/abs/2507.13347

  40. [40]

    IEEE Transactions on Image Process- ing13(4), 600–612 (2004)

    Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Process- ing13(4), 600–612 (2004)

  41. [41]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

    Wei, D., Li, Z., Liu, P.: Omni-scene: Omni-gaussian representation for ego-centric sparse-view scene reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

  42. [42]

    MVInverse: Feed-forward multi-view inverse rendering in seconds.arXiv preprint arXiv:2512.21003, 2025

    Wu, X., Ren, C., Zhou, J., Li, X., Liu, Y.: Mvinverse: Feed-forward multi-view inverse rendering in seconds. arXiv preprint arXiv:2512.21003 (2025)

  43. [43]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

    Xu, H., Peng, S., Wang, F., Blum, H., Barath, D., Geiger, A., Pollefeys, M.: Depth- splat: Connecting gaussian splatting and depth. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

  44. [44]

    In: European Conference on Computer Vision

    Yang,J.,Desai,K.,Packer,C.,Bhatia,H.,Rhinehart,N.,McAllister,R.,Gonzalez, J.E.: Carff: Conditional auto-encoded radiance field for 3d scene forecasting. In: European Conference on Computer Vision. pp. 225–242. Springer (2024)

  45. [45]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

    Yang,Z.,Chen,Y.,Wang,J.,Manivasagam,S.,Ma,W.C.,Yang,A.J.,Urtasun,R.: Unisim: A neural closed-loop sensor simulator. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

  46. [46]

    arXiv preprint arXiv:2410.24207 (2024)

    Ye, B., Liu, S., Xu, H., Li, X., Pollefeys, M., Yang, M.H., Peng, S.: No pose, no problem: Surprisingly simple 3d gaussian splats from sparse unposed images. arXiv preprint arXiv:2410.24207 (2024)

  47. [47]

    In: Proceedings of the European Conference on Computer Vision (ECCV) (2022) 18 Q

    Zhang, K., Kolkin, N., Bi, S., Luan, F., Xu, Z., Shechtman, E., Snavely, N.: Arf: Artistic radiance fields. In: Proceedings of the European Conference on Computer Vision (ECCV) (2022) 18 Q. Herau et al

  48. [48]

    In: ICLR (2025)

    Zhang, L., Rao, A., Agrawala, M.: Scaling in-the-wild training for diffusion-based illumination harmonization and editing by imposing consistent light transport. In: ICLR (2025)

  49. [49]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effec- tiveness of deep features as a perceptual metric. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 586–595 (2018)

  50. [50]

    In: Proceedings of the IEEE Interna- tional Conference on Computer Vision (ICCV)

    Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE Interna- tional Conference on Computer Vision (ICCV). pp. 2223–2232 (2017) SpectralSplat 19 Supplementary Material Backbone Architecture Details We provide additional details on the UniSplat [28] backbon...