pith. machine review for the scientific record. sign in

arxiv: 2605.07264 · v1 · submitted 2026-05-08 · 💻 cs.CV

Recognition: no theorem link

Sat3R: Satellite DSM Reconstruction via RPC-Aware Depth Fine-tuning

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:21 UTC · model grok-4.3

classification 💻 cs.CV
keywords satellite imageryDSM reconstructiondepth estimationRPC modelfine-tuningfeed-forward modeldigital surface modelmonocular depth
0
0 comments X

The pith

A fine-tuned monocular depth model using RPC geometry matches optimization accuracy for satellite DSM at over 300 times the speed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that a general monocular depth foundation model can be adapted to satellite imagery by fine-tuning it on pseudo depths derived from Rational Polynomial Camera geometry. This matters because existing methods force a choice between high accuracy that requires hours of per-scene optimization and fast feed-forward inference that fails due to the domain gap in depth scales and camera models. If the adaptation succeeds, large-scale DSM reconstruction for disaster response and urban planning becomes practical at near-instant speeds. The approach uses the Scale-Invariant Logarithmic loss to produce metric depths suitable for satellite data without any per-scene processing.

Core claim

Sat3R constructs physically consistent pseudo depth supervision from RPC geometry and uses it to fine-tune Depth Anything V2 with the Scale-Invariant Logarithmic loss. This RPC-aware metric depth fine-tuning adapts the model to the satellite domain, enabling feed-forward DSM reconstruction that reduces MAE by 38 percent over zero-shot baselines while achieving competitive accuracy against optimization-based methods at more than 300 times the speed on the DFC2019 benchmark.

What carries the argument

RPC-aware metric depth fine-tuning that adapts a monocular depth foundation model using physically consistent pseudo depth supervision derived from Rational Polynomial Camera geometry.

Load-bearing premise

Pseudo depth maps constructed from RPC geometry supply accurate and unbiased training signals that are sufficient to adapt the foundation model to satellite imagery without introducing systematic errors or needing per-scene optimization.

What would settle it

If Sat3R produces no MAE reduction over zero-shot baselines or falls short of optimization-based accuracy on the DFC2019 benchmark and similar satellite test sets with varied RPC parameters, the claim that RPC-aware fine-tuning bridges the domain gap would be falsified.

Figures

Figures reproduced from arXiv: 2605.07264 by Chaoyi Zhou, Feng Luo, Hairong Qi, Mert D. Pes\'e, Minghui Xu, Qiaoyi Yang, Qiushi Chen, Run Wang, Siyu Huang, Xi Liu, Yuhao Xu, Zhi-Qi Cheng.

Figure 1
Figure 1. Figure 1: Runtime vs. accuracy comparison on DFC2019 [ [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of Sat3R. Given multi-view satellite images and their associated RPC models, we first construct pseudo depth super [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison of DSM reconstruction results on selected DFC2019 scenes. Red boxes highlight representative regions [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative ablation on the maximum depth threshold. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
read the original abstract

Accurate Digital Surface Model (DSM) reconstruction from satellite imagery is critical for applications such as disaster response, urban planning, and large-scale geographic mapping. Existing approaches face a fundamental trade-off: optimization-based methods achieve strong accuracy but require hours of per-scene computation, while generalizable geometry foundation models offer near-instant inference but fail to generalize to satellite imagery due to the domain gap introduced by the Rational Polynomial Camera (RPC) model and mismatched depth scale distributions. We present Sat3R, a feed-forward framework that bridges this gap via RPC-aware metric depth fine-tuning of Depth Anything V2 using the Scale-Invariant Logarithmic (SiLog) loss. By constructing physically consistent pseudo depth supervision from RPC geometry, Sat3R adapts a monocular depth foundation model to the satellite domain without per-scene optimization. Experiments on the DFC2019 benchmark demonstrate that Sat3R reduces MAE by 38% over zero-shot feed-forward baselines and achieves competitive accuracy against optimization-based methods, while delivering over 300x speedup. Sat3R demonstrates that feed-forward models, when properly adapted to the satellite domain, can match optimization-based accuracy at a fraction of the computational cost, paving the way for practical large-scale satellite DSM reconstruction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Sat3R, a feed-forward framework for satellite DSM reconstruction that adapts the Depth Anything V2 monocular depth foundation model via RPC-aware metric depth fine-tuning. It constructs physically consistent pseudo-depth supervision from Rational Polynomial Camera (RPC) geometry and optimizes with the Scale-Invariant Logarithmic (SiLog) loss, avoiding per-scene optimization. On the DFC2019 benchmark, Sat3R reports a 38% MAE reduction over zero-shot feed-forward baselines, competitive accuracy with optimization-based methods, and >300x speedup.

Significance. If the pseudo-depth supervision proves accurate and unbiased, the work would demonstrate that domain-adapted feed-forward models can close the accuracy gap with slow optimization-based DSM pipelines while retaining near-instant inference, enabling practical large-scale satellite mapping for disaster response and urban planning.

major comments (2)
  1. [§3.2] §3.2 (Pseudo-depth supervision construction): The central adaptation claim rests on RPC-derived pseudo depths supplying accurate, unbiased training signals for fine-tuning. The manuscript states these labels are 'physically consistent' but supplies no quantitative validation (e.g., MAE, scale bias, or residual statistics of the pseudo depths versus DFC2019 ground-truth DSM on the fine-tuning scenes). RPC models are known approximations; without this check it is impossible to attribute the reported 38% MAE gain to the fine-tuning procedure rather than to the quality or bias of the supervision.
  2. [§4.2] §4.2 (Experiments and ablations): The results claim a 38% MAE reduction and competitive accuracy, yet the text provides no error bars, multiple-run statistics, or ablation isolating the contribution of RPC-aware supervision versus standard fine-tuning. This makes it difficult to verify that the gains are robust and directly attributable to the proposed RPC-aware component.
minor comments (2)
  1. [Abstract] Abstract and §1: The term 'physically consistent' is used without a precise definition or reference to the RPC residual model; a short clarifying sentence would improve readability.
  2. [§4.1] §4.1: Table captions and axis labels in the quantitative comparison figures could more explicitly list the exact baselines (zero-shot vs. fine-tuned) to avoid reader confusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important aspects for strengthening the presentation of our work. We address each major comment below and will revise the manuscript to incorporate additional validation and analysis.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Pseudo-depth supervision construction): The central adaptation claim rests on RPC-derived pseudo depths supplying accurate, unbiased training signals for fine-tuning. The manuscript states these labels are 'physically consistent' but supplies no quantitative validation (e.g., MAE, scale bias, or residual statistics of the pseudo depths versus DFC2019 ground-truth DSM on the fine-tuning scenes). RPC models are known approximations; without this check it is impossible to attribute the reported 38% MAE gain to the fine-tuning procedure rather than to the quality or bias of the supervision.

    Authors: We agree that explicit quantitative validation of the pseudo-depth labels was not included in the original manuscript. Although the pseudo-depths are constructed directly from RPC geometry and stereo pairs (ensuring consistency with the camera model by design), we acknowledge that reporting error metrics against ground-truth DSMs on the fine-tuning scenes would strengthen the claim. In the revised version, we will add a new paragraph and table in §3.2 with MAE, scale bias, and residual statistics of the pseudo-depths versus DFC2019 ground truth on the training scenes. This will allow readers to evaluate the supervision quality independently. revision: yes

  2. Referee: [§4.2] §4.2 (Experiments and ablations): The results claim a 38% MAE reduction and competitive accuracy, yet the text provides no error bars, multiple-run statistics, or ablation isolating the contribution of RPC-aware supervision versus standard fine-tuning. This makes it difficult to verify that the gains are robust and directly attributable to the proposed RPC-aware component.

    Authors: We concur that the lack of statistical reporting and targeted ablations limits the ability to assess robustness and isolate the RPC-aware component. In the revision, we will add error bars (standard deviation across multiple training runs with different random seeds) to the main results table. We will also include a dedicated ablation subsection in §4.2 comparing (i) zero-shot baseline, (ii) standard fine-tuning without RPC awareness, and (iii) our full RPC-aware fine-tuning. This will directly demonstrate the contribution of the proposed supervision construction. revision: yes

Circularity Check

0 steps flagged

No circularity: external RPC geometry supplies independent supervision

full rationale

The paper constructs pseudo-depth labels directly from the RPC camera model (an external geometric prior) and applies standard SiLog fine-tuning to Depth Anything V2. No equation or claim reduces by construction to the model's own outputs, fitted parameters, or prior self-citations; the adaptation is a conventional transfer-learning step whose success is measured against the independent DFC2019 benchmark. The derivation chain therefore remains non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that RPC geometry yields reliable pseudo-depth targets; no explicit free parameters or new entities are named in the abstract.

axioms (1)
  • domain assumption The Rational Polynomial Camera model supplies accurate geometric constraints that can be converted into metric pseudo-depth supervision for fine-tuning.
    Invoked when the abstract states that physically consistent pseudo depth supervision is constructed from RPC geometry.

pith-pipeline@v0.9.0 · 5556 in / 1271 out tokens · 38358 ms · 2026-05-11T01:21:33.518355+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 2 internal anchors

  1. [1]

    Zoedepth: Zero-shot transfer by com- bining relative and metric depth, 2023

    Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias M ¨uller. Zoedepth: Zero-shot transfer by com- bining relative and metric depth, 2023. 2, 3

  2. [2]

    Sat-ngp : Unleashing neural graphics primitives for fast relightable transient-free 3d reconstruction from satellite imagery, 2024

    Camille Billouard, Dawa Derksen, Emmanuelle Sarrazin, and Bruno Vallet. Sat-ngp : Unleashing neural graphics primitives for fast relightable transient-free 3d reconstruction from satellite imagery, 2024. 2

  3. [3]

    2d gaussian splatting for geometrically accu- rate radiance fields

    Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and Shenghua Gao. 2d gaussian splatting for geometrically accu- rate radiance fields. InSIGGRAPH 2024 Conference Papers. Association for Computing Machinery, 2024. 2

  4. [4]

    Anysplat: Feed-forward 3d gaussian splatting from unconstrained views.arXiv preprint arXiv:2505.23716,

    Lihan Jiang, Yucheng Mao, Linning Xu, Tao Lu, Kerui Ren, Yichen Jin, Xudong Xu, Mulin Yu, Jiangmiao Pang, Feng Zhao, et al. Anysplat: Feed-forward 3d gaussian splatting from unconstrained views.arXiv preprint arXiv:2505.23716,

  5. [5]

    3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42 (4), 2023

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42 (4), 2023. 2

  6. [6]

    Ground- ing image matching in 3d with mast3r, 2024

    Vincent Leroy, Yohann Cabon, and Jerome Revaud. Ground- ing image matching in 3d with mast3r, 2024. 2

  7. [7]

    Depth Anything 3: Recovering the Visual Space from Any Views

    Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y . Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: recovering the visual space from any views. arXiv preprint arXiv:2511.10647, 2025. 2, 4, 7

  8. [8]

    Sat-dn: Implicit surface reconstruction from multi- view satellite images with depth and normal supervision

    Tianle Liu, Shuangming Zhao, Wanshou Jiang, and Bingx- uan Guo. Sat-dn: Implicit surface reconstruction from multi- view satellite images with depth and normal supervision. arXiv preprint arXiv:2502.08352, 2025. 1, 2, 3, 4, 7

  9. [9]

    3dgs-enhancer: Enhancing unbounded 3d gaussian splatting with view- consistent 2d diffusion priors

    Xi Liu, Chaoyi Zhou, and Siyu Huang. 3dgs-enhancer: Enhancing unbounded 3d gaussian splatting with view- consistent 2d diffusion priors. InAdvances in Neural In- formation Processing Systems (NeurIPS), 2024. 2

  10. [10]

    Sat- NeRF: Learning multi-view satellite photogrammetry with transient objects and shadow modeling using RPC cameras

    Roger Mar ´ı, Gabriele Facciolo, and Thibaud Ehret. Sat- NeRF: Learning multi-view satellite photogrammetry with transient objects and shadow modeling using RPC cameras. In2022 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition Workshops (CVPRW), pages 1310–1320,

  11. [11]

    Multi- date earth observation nerf: The detail is in the shadows

    Roger Mar ´ı, Gabriele Facciolo, and Thibaud Ehret. Multi- date earth observation nerf: The detail is in the shadows. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 2034–2044, 2023. 2

  12. [12]

    Srinivasan, Matthew Tancik, Jonathan T

    Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis. InECCV, 2020. 2

  13. [13]

    Srinivasan, Peter Hedman, Ricardo Martin-Brualla, and Jonathan T

    Ben Mildenhall, Dor Verbin, Pratul P. Srinivasan, Peter Hedman, Ricardo Martin-Brualla, and Jonathan T. Barron. MultiNeRF: A Code Release for Mip-NeRF 360, Ref-NeRF, and RawNeRF, 2022

  14. [14]

    Instant neural graphics primitives with a multires- olution hash encoding.ACM Trans

    Thomas M ¨uller, Alex Evans, Christoph Schied, and Alexan- der Keller. Instant neural graphics primitives with a multires- olution hash encoding.ACM Trans. Graph., 41(4):102:1– 102:15, 2022. 2

  15. [15]

    Global Structure-from-Motion Revisited

    Linfei Pan, Daniel Barath, Marc Pollefeys, and Jo- hannes Lutz Sch ¨onberger. Global Structure-from-Motion Revisited. InEuropean Conference on Computer Vision (ECCV), 2024. 2

  16. [16]

    Sat-mesh: Learning neural im- plicit surfaces for multi-view satellite reconstruction.Remote Sensing, 15:4297, 2023

    Yingjie Qu and Fei Deng. Sat-mesh: Learning neural im- plicit surfaces for multi-view satellite reconstruction.Remote Sensing, 15:4297, 2023. 2

  17. [17]

    Data fusion contest 2019 (dfc2019), 2019

    Bertrand Le Saux, Naoto Yokoya, Ronny H ¨ansch, and My- ron Brown. Data fusion contest 2019 (dfc2019), 2019. 1, 2, 3, 4, 5

  18. [18]

    Structure-from-motion revisited

    Johannes Lutz Sch ¨onberger and Jan-Michael Frahm. Structure-from-motion revisited. InConference on Com- puter Vision and Pattern Recognition (CVPR), 2016. 2

  19. [19]

    A vote-and-verify strat- egy for fast spatial verification in image retrieval

    Johannes Lutz Sch ¨onberger, True Price, Torsten Sattler, Jan- Michael Frahm, and Marc Pollefeys. A vote-and-verify strat- egy for fast spatial verification in image retrieval. InAsian Conference on Computer Vision (ACCV), 2016

  20. [20]

    Pixelwise view selection for un- structured multi-view stereo

    Johannes Lutz Sch ¨onberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. Pixelwise view selection for un- structured multi-view stereo. InEuropean Conference on Computer Vision (ECCV), 2016. 2

  21. [21]

    Vggt: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 1, 2, 4, 7

  22. [22]

    arXiv preprint arXiv:2106.10689 , year=

    Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku Komura, and Wenping Wang. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. arXiv preprint arXiv:2106.10689, 2021. 2

  23. [23]

    Continuous 3d perception model with persistent state.arXiv preprint arXiv:2501.12387, 2025

    Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state.arXiv preprint arXiv:2501.12387, 2025. 2

  24. [24]

    Pes´e, and Siyu Huang

    Run Wang, Chaoyi Zhou, Amir Salarpour, Xi Liu, Zhi-Qi Cheng, Feng Luo, Mert D. Pes´e, and Siyu Huang. Flexmap: Generalized hd map construction from flexible camera con- figurations, 2026

  25. [25]

    Dust3r: Geometric 3d vi- sion made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. InCVPR, 2024

  26. [26]

    Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli

    Jianing Yang, Alexander Sax, Kevin J. Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ im- ages in one forward pass. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 2

  27. [27]

    Depth anything: Unleashing the power of large-scale unlabeled data

    Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. InCVPR, 2024. 2

  28. [28]

    Depth Anything V2

    Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth any- thing v2.arXiv:2406.09414, 2024. 1, 2, 3, 4, 7

  29. [29]

    Mip-splatting: Alias-free 3d gaussian splat- ting

    Zehao Yu, Anpei Chen, Binbin Huang, Torsten Sattler, and Andreas Geiger. Mip-splatting: Alias-free 3d gaussian splat- ting. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 19447– 19456, 2024. 2

  30. [30]

    Monst3r: A simple approach for estimat- ing geometry in the presence of motion.arXiv preprint arXiv:2410.03825, 2024

    Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jam- pani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming- Hsuan Yang. Monst3r: A simple approach for estimat- ing geometry in the presence of motion.arXiv preprint arxiv:2410.03825, 2024. 2

  31. [31]

    Latent radiance fields with 3d-aware 2d representations

    Chaoyi Zhou, Xi Liu, Feng Luo, and Siyu Huang. Latent radiance fields with 3d-aware 2d representations. InInter- national Conference on Learning Representations (ICLR),

  32. [32]

    Pes ´e, Zhiwen Fan, Yiqi Zhong, and Siyu Huang

    Chaoyi Zhou, Run Wang, Feng Luo, Mert D. Pes ´e, Zhiwen Fan, Yiqi Zhong, and Siyu Huang. Ff3r: Feedforward fea- ture 3d reconstruction from unconstrained views. InCVPR Findings, 2026. 2 Max Depth Mean MAE↓Mean MED↓ 100 3.312 2.254 300 3.412 2.354 150 3.131 1.963 Table 2. Ablation study on the maximum depth threshold. Appendix In the Appendix, we provide t...