arxiv: 2605.07264 · v1 · submitted 2026-05-08 · 💻 cs.CV

Recognition: no theorem link

Sat3R: Satellite DSM Reconstruction via RPC-Aware Depth Fine-tuning

Qiaoyi Yang , Chaoyi Zhou , Xi Liu , Run Wang , Minghui Xu , Mert D. Pes\'e , Feng Luo , Yuhao Xu

show 4 more authors

Zhi-Qi Cheng Qiushi Chen Hairong Qi Siyu Huang

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:21 UTC · model grok-4.3

classification 💻 cs.CV

keywords satellite imageryDSM reconstructiondepth estimationRPC modelfine-tuningfeed-forward modeldigital surface modelmonocular depth

0 comments

The pith

A fine-tuned monocular depth model using RPC geometry matches optimization accuracy for satellite DSM at over 300 times the speed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that a general monocular depth foundation model can be adapted to satellite imagery by fine-tuning it on pseudo depths derived from Rational Polynomial Camera geometry. This matters because existing methods force a choice between high accuracy that requires hours of per-scene optimization and fast feed-forward inference that fails due to the domain gap in depth scales and camera models. If the adaptation succeeds, large-scale DSM reconstruction for disaster response and urban planning becomes practical at near-instant speeds. The approach uses the Scale-Invariant Logarithmic loss to produce metric depths suitable for satellite data without any per-scene processing.

Core claim

Sat3R constructs physically consistent pseudo depth supervision from RPC geometry and uses it to fine-tune Depth Anything V2 with the Scale-Invariant Logarithmic loss. This RPC-aware metric depth fine-tuning adapts the model to the satellite domain, enabling feed-forward DSM reconstruction that reduces MAE by 38 percent over zero-shot baselines while achieving competitive accuracy against optimization-based methods at more than 300 times the speed on the DFC2019 benchmark.

What carries the argument

RPC-aware metric depth fine-tuning that adapts a monocular depth foundation model using physically consistent pseudo depth supervision derived from Rational Polynomial Camera geometry.

Load-bearing premise

Pseudo depth maps constructed from RPC geometry supply accurate and unbiased training signals that are sufficient to adapt the foundation model to satellite imagery without introducing systematic errors or needing per-scene optimization.

What would settle it

If Sat3R produces no MAE reduction over zero-shot baselines or falls short of optimization-based accuracy on the DFC2019 benchmark and similar satellite test sets with varied RPC parameters, the claim that RPC-aware fine-tuning bridges the domain gap would be falsified.

Figures

Figures reproduced from arXiv: 2605.07264 by Chaoyi Zhou, Feng Luo, Hairong Qi, Mert D. Pes\'e, Minghui Xu, Qiaoyi Yang, Qiushi Chen, Run Wang, Siyu Huang, Xi Liu, Yuhao Xu, Zhi-Qi Cheng.

**Figure 2.** Figure 2: Overview of Sat3R. Given multi-view satellite images and their associated RPC models, we first construct pseudo depth super [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative comparison of DSM reconstruction results on selected DFC2019 scenes. Red boxes highlight representative regions [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative ablation on the maximum depth threshold. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Accurate Digital Surface Model (DSM) reconstruction from satellite imagery is critical for applications such as disaster response, urban planning, and large-scale geographic mapping. Existing approaches face a fundamental trade-off: optimization-based methods achieve strong accuracy but require hours of per-scene computation, while generalizable geometry foundation models offer near-instant inference but fail to generalize to satellite imagery due to the domain gap introduced by the Rational Polynomial Camera (RPC) model and mismatched depth scale distributions. We present Sat3R, a feed-forward framework that bridges this gap via RPC-aware metric depth fine-tuning of Depth Anything V2 using the Scale-Invariant Logarithmic (SiLog) loss. By constructing physically consistent pseudo depth supervision from RPC geometry, Sat3R adapts a monocular depth foundation model to the satellite domain without per-scene optimization. Experiments on the DFC2019 benchmark demonstrate that Sat3R reduces MAE by 38% over zero-shot feed-forward baselines and achieves competitive accuracy against optimization-based methods, while delivering over 300x speedup. Sat3R demonstrates that feed-forward models, when properly adapted to the satellite domain, can match optimization-based accuracy at a fraction of the computational cost, paving the way for practical large-scale satellite DSM reconstruction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Sat3R shows how to fine-tune Depth Anything V2 on satellite data with RPC-derived pseudo depths to get feed-forward DSMs that are faster and more accurate than zero-shot baselines, but the quality of those pseudo labels is not directly checked against ground truth.

read the letter

Sat3R adapts a monocular depth foundation model to satellite DSM reconstruction by fine-tuning Depth Anything V2 with pseudo-depth supervision built from RPC geometry and the SiLog loss. This is the core new piece: turning the RPC model into a source of metric training signals so the network can handle the domain shift without running per-scene optimization at inference time. On the DFC2019 benchmark the method cuts MAE by 38 percent relative to zero-shot feed-forward baselines and reaches accuracy levels close to slower optimization approaches while running more than 300 times faster. That combination of numbers is the practical payoff for anyone who needs large-scale satellite mapping without heavy compute per scene. The paper does a clean job laying out the problem of mismatched depth scales and RPC geometry, then shows a direct way to close the gap with an existing foundation model. The results are reported on a standard benchmark and the speedup claim is easy to understand. The soft spot is exactly the one the stress-test flags. The abstract calls the pseudo depths physically consistent, yet there is no reported comparison of those labels to actual ground-truth DSMs on the scenes used for fine-tuning. RPC models are approximations, and if their residuals are on the order of the claimed accuracy gains, the fine-tuning success could be overstated. The full paper would be stronger with a simple MAE or bias check between the RPC pseudo depths and real DSMs, plus a few ablations on the fine-tuning choices. Without that, it is hard to know how much of the improvement comes from the adaptation versus the quality of the supervision signal. The work is aimed at remote-sensing and 3D-vision groups that want faster alternatives to bundle-adjustment pipelines. A reader looking for concrete ways to port foundation models to new camera models will find usable ideas here. It deserves a serious referee because the problem is real, the pipeline is straightforward, and the reported gains are large enough to matter if the supervision holds up under scrutiny. I would send it to peer review and expect the main revision requests to focus on validating the pseudo-depth labels.

Referee Report

2 major / 2 minor

Summary. The paper introduces Sat3R, a feed-forward framework for satellite DSM reconstruction that adapts the Depth Anything V2 monocular depth foundation model via RPC-aware metric depth fine-tuning. It constructs physically consistent pseudo-depth supervision from Rational Polynomial Camera (RPC) geometry and optimizes with the Scale-Invariant Logarithmic (SiLog) loss, avoiding per-scene optimization. On the DFC2019 benchmark, Sat3R reports a 38% MAE reduction over zero-shot feed-forward baselines, competitive accuracy with optimization-based methods, and >300x speedup.

Significance. If the pseudo-depth supervision proves accurate and unbiased, the work would demonstrate that domain-adapted feed-forward models can close the accuracy gap with slow optimization-based DSM pipelines while retaining near-instant inference, enabling practical large-scale satellite mapping for disaster response and urban planning.

major comments (2)

[§3.2] §3.2 (Pseudo-depth supervision construction): The central adaptation claim rests on RPC-derived pseudo depths supplying accurate, unbiased training signals for fine-tuning. The manuscript states these labels are 'physically consistent' but supplies no quantitative validation (e.g., MAE, scale bias, or residual statistics of the pseudo depths versus DFC2019 ground-truth DSM on the fine-tuning scenes). RPC models are known approximations; without this check it is impossible to attribute the reported 38% MAE gain to the fine-tuning procedure rather than to the quality or bias of the supervision.
[§4.2] §4.2 (Experiments and ablations): The results claim a 38% MAE reduction and competitive accuracy, yet the text provides no error bars, multiple-run statistics, or ablation isolating the contribution of RPC-aware supervision versus standard fine-tuning. This makes it difficult to verify that the gains are robust and directly attributable to the proposed RPC-aware component.

minor comments (2)

[Abstract] Abstract and §1: The term 'physically consistent' is used without a precise definition or reference to the RPC residual model; a short clarifying sentence would improve readability.
[§4.1] §4.1: Table captions and axis labels in the quantitative comparison figures could more explicitly list the exact baselines (zero-shot vs. fine-tuned) to avoid reader confusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important aspects for strengthening the presentation of our work. We address each major comment below and will revise the manuscript to incorporate additional validation and analysis.

read point-by-point responses

Referee: [§3.2] §3.2 (Pseudo-depth supervision construction): The central adaptation claim rests on RPC-derived pseudo depths supplying accurate, unbiased training signals for fine-tuning. The manuscript states these labels are 'physically consistent' but supplies no quantitative validation (e.g., MAE, scale bias, or residual statistics of the pseudo depths versus DFC2019 ground-truth DSM on the fine-tuning scenes). RPC models are known approximations; without this check it is impossible to attribute the reported 38% MAE gain to the fine-tuning procedure rather than to the quality or bias of the supervision.

Authors: We agree that explicit quantitative validation of the pseudo-depth labels was not included in the original manuscript. Although the pseudo-depths are constructed directly from RPC geometry and stereo pairs (ensuring consistency with the camera model by design), we acknowledge that reporting error metrics against ground-truth DSMs on the fine-tuning scenes would strengthen the claim. In the revised version, we will add a new paragraph and table in §3.2 with MAE, scale bias, and residual statistics of the pseudo-depths versus DFC2019 ground truth on the training scenes. This will allow readers to evaluate the supervision quality independently. revision: yes
Referee: [§4.2] §4.2 (Experiments and ablations): The results claim a 38% MAE reduction and competitive accuracy, yet the text provides no error bars, multiple-run statistics, or ablation isolating the contribution of RPC-aware supervision versus standard fine-tuning. This makes it difficult to verify that the gains are robust and directly attributable to the proposed RPC-aware component.

Authors: We concur that the lack of statistical reporting and targeted ablations limits the ability to assess robustness and isolate the RPC-aware component. In the revision, we will add error bars (standard deviation across multiple training runs with different random seeds) to the main results table. We will also include a dedicated ablation subsection in §4.2 comparing (i) zero-shot baseline, (ii) standard fine-tuning without RPC awareness, and (iii) our full RPC-aware fine-tuning. This will directly demonstrate the contribution of the proposed supervision construction. revision: yes

Circularity Check

0 steps flagged

No circularity: external RPC geometry supplies independent supervision

full rationale

The paper constructs pseudo-depth labels directly from the RPC camera model (an external geometric prior) and applies standard SiLog fine-tuning to Depth Anything V2. No equation or claim reduces by construction to the model's own outputs, fitted parameters, or prior self-citations; the adaptation is a conventional transfer-learning step whose success is measured against the independent DFC2019 benchmark. The derivation chain therefore remains non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that RPC geometry yields reliable pseudo-depth targets; no explicit free parameters or new entities are named in the abstract.

axioms (1)

domain assumption The Rational Polynomial Camera model supplies accurate geometric constraints that can be converted into metric pseudo-depth supervision for fine-tuning.
Invoked when the abstract states that physically consistent pseudo depth supervision is constructed from RPC geometry.

pith-pipeline@v0.9.0 · 5556 in / 1271 out tokens · 38358 ms · 2026-05-11T01:21:33.518355+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 2 internal anchors

[1]

Zoedepth: Zero-shot transfer by com- bining relative and metric depth, 2023

Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias M ¨uller. Zoedepth: Zero-shot transfer by com- bining relative and metric depth, 2023. 2, 3

work page 2023
[2]

Sat-ngp : Unleashing neural graphics primitives for fast relightable transient-free 3d reconstruction from satellite imagery, 2024

Camille Billouard, Dawa Derksen, Emmanuelle Sarrazin, and Bruno Vallet. Sat-ngp : Unleashing neural graphics primitives for fast relightable transient-free 3d reconstruction from satellite imagery, 2024. 2

work page 2024
[3]

2d gaussian splatting for geometrically accu- rate radiance fields

Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and Shenghua Gao. 2d gaussian splatting for geometrically accu- rate radiance fields. InSIGGRAPH 2024 Conference Papers. Association for Computing Machinery, 2024. 2

work page 2024
[4]

Anysplat: Feed-forward 3d gaussian splatting from unconstrained views.arXiv preprint arXiv:2505.23716,

Lihan Jiang, Yucheng Mao, Linning Xu, Tao Lu, Kerui Ren, Yichen Jin, Xudong Xu, Mulin Yu, Jiangmiao Pang, Feng Zhao, et al. Anysplat: Feed-forward 3d gaussian splatting from unconstrained views.arXiv preprint arXiv:2505.23716,

work page arXiv
[5]

3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42 (4), 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42 (4), 2023. 2

work page 2023
[6]

Ground- ing image matching in 3d with mast3r, 2024

Vincent Leroy, Yohann Cabon, and Jerome Revaud. Ground- ing image matching in 3d with mast3r, 2024. 2

work page 2024
[7]

Depth Anything 3: Recovering the Visual Space from Any Views

Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y . Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: recovering the visual space from any views. arXiv preprint arXiv:2511.10647, 2025. 2, 4, 7

work page internal anchor Pith review arXiv 2025
[8]

Sat-dn: Implicit surface reconstruction from multi- view satellite images with depth and normal supervision

Tianle Liu, Shuangming Zhao, Wanshou Jiang, and Bingx- uan Guo. Sat-dn: Implicit surface reconstruction from multi- view satellite images with depth and normal supervision. arXiv preprint arXiv:2502.08352, 2025. 1, 2, 3, 4, 7

work page arXiv 2025
[9]

3dgs-enhancer: Enhancing unbounded 3d gaussian splatting with view- consistent 2d diffusion priors

Xi Liu, Chaoyi Zhou, and Siyu Huang. 3dgs-enhancer: Enhancing unbounded 3d gaussian splatting with view- consistent 2d diffusion priors. InAdvances in Neural In- formation Processing Systems (NeurIPS), 2024. 2

work page 2024
[10]

Sat- NeRF: Learning multi-view satellite photogrammetry with transient objects and shadow modeling using RPC cameras

Roger Mar ´ı, Gabriele Facciolo, and Thibaud Ehret. Sat- NeRF: Learning multi-view satellite photogrammetry with transient objects and shadow modeling using RPC cameras. In2022 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition Workshops (CVPRW), pages 1310–1320,

work page
[11]

Multi- date earth observation nerf: The detail is in the shadows

Roger Mar ´ı, Gabriele Facciolo, and Thibaud Ehret. Multi- date earth observation nerf: The detail is in the shadows. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 2034–2044, 2023. 2

work page 2034
[12]

Srinivasan, Matthew Tancik, Jonathan T

Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis. InECCV, 2020. 2

work page 2020
[13]

Srinivasan, Peter Hedman, Ricardo Martin-Brualla, and Jonathan T

Ben Mildenhall, Dor Verbin, Pratul P. Srinivasan, Peter Hedman, Ricardo Martin-Brualla, and Jonathan T. Barron. MultiNeRF: A Code Release for Mip-NeRF 360, Ref-NeRF, and RawNeRF, 2022

work page 2022
[14]

Instant neural graphics primitives with a multires- olution hash encoding.ACM Trans

Thomas M ¨uller, Alex Evans, Christoph Schied, and Alexan- der Keller. Instant neural graphics primitives with a multires- olution hash encoding.ACM Trans. Graph., 41(4):102:1– 102:15, 2022. 2

work page 2022
[15]

Global Structure-from-Motion Revisited

Linfei Pan, Daniel Barath, Marc Pollefeys, and Jo- hannes Lutz Sch ¨onberger. Global Structure-from-Motion Revisited. InEuropean Conference on Computer Vision (ECCV), 2024. 2

work page 2024
[16]

Sat-mesh: Learning neural im- plicit surfaces for multi-view satellite reconstruction.Remote Sensing, 15:4297, 2023

Yingjie Qu and Fei Deng. Sat-mesh: Learning neural im- plicit surfaces for multi-view satellite reconstruction.Remote Sensing, 15:4297, 2023. 2

work page 2023
[17]

Data fusion contest 2019 (dfc2019), 2019

Bertrand Le Saux, Naoto Yokoya, Ronny H ¨ansch, and My- ron Brown. Data fusion contest 2019 (dfc2019), 2019. 1, 2, 3, 4, 5

work page 2019
[18]

Structure-from-motion revisited

Johannes Lutz Sch ¨onberger and Jan-Michael Frahm. Structure-from-motion revisited. InConference on Com- puter Vision and Pattern Recognition (CVPR), 2016. 2

work page 2016
[19]

A vote-and-verify strat- egy for fast spatial verification in image retrieval

Johannes Lutz Sch ¨onberger, True Price, Torsten Sattler, Jan- Michael Frahm, and Marc Pollefeys. A vote-and-verify strat- egy for fast spatial verification in image retrieval. InAsian Conference on Computer Vision (ACCV), 2016

work page 2016
[20]

Pixelwise view selection for un- structured multi-view stereo

Johannes Lutz Sch ¨onberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. Pixelwise view selection for un- structured multi-view stereo. InEuropean Conference on Computer Vision (ECCV), 2016. 2

work page 2016
[21]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 1, 2, 4, 7

work page 2025
[22]

arXiv preprint arXiv:2106.10689 , year=

Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku Komura, and Wenping Wang. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. arXiv preprint arXiv:2106.10689, 2021. 2

work page arXiv 2021
[23]

Continuous 3d perception model with persistent state.arXiv preprint arXiv:2501.12387, 2025

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state.arXiv preprint arXiv:2501.12387, 2025. 2

work page arXiv 2025
[24]

Pes´e, and Siyu Huang

Run Wang, Chaoyi Zhou, Amir Salarpour, Xi Liu, Zhi-Qi Cheng, Feng Luo, Mert D. Pes´e, and Siyu Huang. Flexmap: Generalized hd map construction from flexible camera con- figurations, 2026

work page 2026
[25]

Dust3r: Geometric 3d vi- sion made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. InCVPR, 2024

work page 2024
[26]

Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli

Jianing Yang, Alexander Sax, Kevin J. Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ im- ages in one forward pass. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 2

work page 2025
[27]

Depth anything: Unleashing the power of large-scale unlabeled data

Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. InCVPR, 2024. 2

work page 2024
[28]

Depth Anything V2

Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth any- thing v2.arXiv:2406.09414, 2024. 1, 2, 3, 4, 7

work page internal anchor Pith review arXiv 2024
[29]

Mip-splatting: Alias-free 3d gaussian splat- ting

Zehao Yu, Anpei Chen, Binbin Huang, Torsten Sattler, and Andreas Geiger. Mip-splatting: Alias-free 3d gaussian splat- ting. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 19447– 19456, 2024. 2

work page 2024
[30]

Monst3r: A simple approach for estimat- ing geometry in the presence of motion.arXiv preprint arXiv:2410.03825, 2024

Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jam- pani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming- Hsuan Yang. Monst3r: A simple approach for estimat- ing geometry in the presence of motion.arXiv preprint arxiv:2410.03825, 2024. 2

work page arXiv 2024
[31]

Latent radiance fields with 3d-aware 2d representations

Chaoyi Zhou, Xi Liu, Feng Luo, and Siyu Huang. Latent radiance fields with 3d-aware 2d representations. InInter- national Conference on Learning Representations (ICLR),

work page
[32]

Pes ´e, Zhiwen Fan, Yiqi Zhong, and Siyu Huang

Chaoyi Zhou, Run Wang, Feng Luo, Mert D. Pes ´e, Zhiwen Fan, Yiqi Zhong, and Siyu Huang. Ff3r: Feedforward fea- ture 3d reconstruction from unconstrained views. InCVPR Findings, 2026. 2 Max Depth Mean MAE↓Mean MED↓ 100 3.312 2.254 300 3.412 2.354 150 3.131 1.963 Table 2. Ablation study on the maximum depth threshold. Appendix In the Appendix, we provide t...

work page 2026