pith. sign in

arxiv: 2605.31577 · v1 · pith:GJBL2L67new · submitted 2026-05-29 · 💻 cs.CV

SurGe: Improved Surface Geometry in Point Maps

Pith reviewed 2026-06-28 22:27 UTC · model grok-4.3

classification 💻 cs.CV
keywords point mapssurface geometrymonocular 3D reconstructionlocal geometry evaluationpoint gradient matchingneighborhood attention decoder3D surface normalszero-shot geometry benchmarks
0
0 comments X

The pith

SurGe improves local surface geometry in point map predictions by adding a gradient matching loss and neighborhood attention decoder.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current feedforward 3D reconstruction methods produce point maps that capture overall scene structure but still show clear local surface inaccuracies. The paper introduces a point map normal metric that checks orientation consistency between neighboring 3D points to make these errors measurable. It then adds a point gradient matching loss that trains the model on depth-normalized 3D differences between points and a Neighborhood Attention Decoder that upsamples while mixing nearby features with attention. The resulting SurGe model records the best average rank on global point map accuracy and also raises scores on the new local metrics across eight zero-shot benchmarks.

Core claim

A point gradient matching loss that supervises depth-normalized 3D finite differences, paired with a Neighborhood Attention Decoder that progressively upsamples and mixes local features via neighborhood attention, yields point maps with more accurate local surface geometry as measured by a new point map normal metric, while also securing the best average rank on global point map AbsRel across eight zero-shot monocular geometry benchmarks.

What carries the argument

The point gradient matching loss, which supervises depth-normalized 3D finite differences, together with the Neighborhood Attention Decoder that performs local feature mixing during progressive upsampling.

If this is right

  • Local point map metrics rise consistently across benchmarks.
  • Point map normal evaluations improve alongside global AbsRel scores.
  • The model holds the top average rank on global point map accuracy in zero-shot settings.
  • The same components can be added to existing point map predictors to target local surface errors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Better local geometry in raw point maps could reduce the need for post-processing steps in surface reconstruction pipelines.
  • The introduced normal metric could be adopted more widely to evaluate and compare local accuracy in future monocular 3D methods.
  • Neighborhood attention during decoding might transfer to other dense prediction tasks where local consistency matters.

Load-bearing premise

The measured gains in local point map and normal metrics come from the gradient matching loss and Neighborhood Attention Decoder rather than from differences in training schedule, data, or model capacity.

What would settle it

An experiment that trains the base architecture with the exact same schedule and data but without the gradient matching loss and NAD, then measures whether local point map and normal metrics still improve at the same rate.

Figures

Figures reproduced from arXiv: 2605.31577 by Bastian Leibe, Christian Schmidt, Daan de Geus, Gonzalo Martin Garcia, Ilya Fradlin, Karim Knaebel, Lucas Nunes.

Figure 1
Figure 1. Figure 1: Qualitative state-of-the-art comparison. SurGe predicts noticeably cleaner point maps. Preprint. arXiv:2605.31577v1 [cs.CV] 29 May 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Pointwise metrics only weakly capture local surface geometry. We add low- and high-frequency perturbations to the same ground-truth point map. AbsRelglob and AbsRelloc average pointwise position errors, giving nearly identical scores even though the high-frequency perturbation yields much less coherent local surface geometry. MAEnormal instead compares point map normals induced by neighboring point differe… view at source ↗
Figure 3
Figure 3. Figure 3: SurGe architecture overview. SurGe combines a DINOv2 [29] encoder with our Neigh￾borhood Attention Decoder (NAD). NAD upsamples encoder features through a sequence of stages ℓ ∈ {1, . . . , 5}, each built from nℓ NAD blocks. Compared to standard ViT [9] blocks, NAD blocks replace global self-attention with Neighborhood Attention [15], use window-matched RoPE [40] and only QK normalization [6] instead of pr… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative decoder ablation. Our NAD produces less warped geometry than a convolu￾tional decoder, visible in the chair legs and the wall to the right [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
read the original abstract

Recent feedforward 3D reconstruction methods predict point maps and estimate global 3D geometry remarkably well. However, their predictions still exhibit inaccurate local surface geometry, which is clearly visible qualitatively but only weakly reflected in common metrics. To make these errors more explicit in evaluation, we introduce a point map normal metric that evaluates the local surface orientation induced by neighboring 3D predictions. To reduce these errors, we propose two complementary components: a point gradient matching loss that supervises depth-normalized 3D finite differences, and a Neighborhood Attention Decoder (NAD) that progressively upsamples features and uses Neighborhood Attention for local feature mixing. Across eight zero-shot monocular geometry benchmarks, our model, SurGe, achieves the best average rank for global point map AbsRel and consistently improves local point map and point map normal evaluations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that current feedforward point-map methods produce inaccurate local surface geometry despite good global performance. It introduces a point-map normal metric to expose these errors, proposes a point gradient matching loss (supervising depth-normalized 3D finite differences) together with a Neighborhood Attention Decoder (NAD) for progressive upsampling and local feature mixing, and reports that the resulting SurGe model obtains the best average rank on global AbsRel across eight zero-shot monocular geometry benchmarks while also improving the local point-map and normal metrics.

Significance. If the reported gains can be shown to arise specifically from the gradient-matching loss and NAD rather than from uncontrolled differences in training schedule, augmentation, or capacity, the work would supply a concrete, modular improvement to local geometry fidelity in feedforward reconstruction pipelines and a new evaluation axis (point-map normals) that makes local surface errors more visible.

major comments (2)
  1. [Abstract / Experiments] The central attribution claim—that the observed improvements in local point-map and point-map normal metrics are produced by the point gradient matching loss and NAD—is not isolated from confounding variables. The abstract states that SurGe is compared against prior methods but supplies no information on whether those baselines were retrained under identical schedules, augmentations, or parameter counts; without such controls the performance delta cannot be confidently assigned to the two proposed components.
  2. [Experiments] No quantitative tables, ablation studies, or error analysis appear in the abstract, and the soundness assessment notes that the full manuscript must be checked for statistical significance, baseline matching, and metric definitions; if these are absent or incomplete in §4 or §5 the empirical support for the “consistently improves” claim is insufficient.
minor comments (1)
  1. [Abstract] The abstract would be strengthened by a single sentence or parenthetical reference to the key quantitative deltas or table numbers that support the “best average rank” and “consistently improves” statements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with clarifications from the manuscript and indicate where revisions will be made.

read point-by-point responses
  1. Referee: [Abstract / Experiments] The central attribution claim—that the observed improvements in local point-map and point-map normal metrics are produced by the point gradient matching loss and NAD—is not isolated from confounding variables. The abstract states that SurGe is compared against prior methods but supplies no information on whether those baselines were retrained under identical schedules, augmentations, or parameter counts; without such controls the performance delta cannot be confidently assigned to the two proposed components.

    Authors: We agree the abstract provides no explicit information on baseline retraining. The full manuscript compares SurGe to the originally reported numbers from prior works, following standard practice for zero-shot monocular geometry benchmarks. However, Section 5 contains controlled ablations that fix training schedule, augmentations, and model capacity while varying only the gradient matching loss and NAD; these isolate the contributions of each component. We will revise the abstract to reference the ablation controls and add a brief statement on baseline evaluation protocol. revision: partial

  2. Referee: [Experiments] No quantitative tables, ablation studies, or error analysis appear in the abstract, and the soundness assessment notes that the full manuscript must be checked for statistical significance, baseline matching, and metric definitions; if these are absent or incomplete in §4 or §5 the empirical support for the “consistently improves” claim is insufficient.

    Authors: Quantitative tables, ablation studies, and error analysis are presented in Sections 4 and 5 of the full manuscript rather than the abstract, which is a high-level summary. Section 4 reports global AbsRel ranks and local point-map/normal metrics across the eight benchmarks; Section 5 provides ablations and qualitative error analysis. Metric definitions appear in Section 3, baselines follow the same evaluation protocol, and statistical significance is assessed via standard paired comparisons. No changes are required. revision: no

Circularity Check

0 steps flagged

No circularity; empirical claims rest on external benchmarks

full rationale

The paper introduces a point gradient matching loss and Neighborhood Attention Decoder, then reports empirical rankings on eight zero-shot monocular geometry benchmarks. No equations, derivations, or first-principles results appear in the abstract or described claims. Performance deltas are attributed to the proposed components via direct comparison against prior methods, with no self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations that reduce the central result to its own inputs by construction. The evaluation chain is therefore self-contained against external data rather than internally forced.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented physical entities are stated. The Neighborhood Attention Decoder is a new architectural module whose training relies on standard deep-learning assumptions not detailed here.

invented entities (1)
  • Neighborhood Attention Decoder (NAD) no independent evidence
    purpose: Progressively upsamples features and applies Neighborhood Attention for local feature mixing to improve surface geometry
    New decoder component introduced to address local geometry errors; no independent evidence outside the paper is provided.

pith-pipeline@v0.9.1-grok · 5684 in / 1096 out tokens · 27959 ms · 2026-06-28T22:27:53.446536+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

69 extracted references · 9 canonical work pages · 1 internal anchor

  1. [1]

    J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization.arXiv preprint arXiv:1607.06450, 2016

  2. [2]

    Baruch, Z

    G. Baruch, Z. Chen, A. Dehghan, T. Dimry, Y . Feigin, P. Fu, T. Gebauer, B. Joffe, D. Kurz, A. Schwartz, and E. Shulman. ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-d data. InNeurIPS Datasets and Benchmarks Track, 2021

  3. [3]

    Bochkovskii, A

    A. Bochkovskii, A. Delaunoy, H. Germain, M. Santos, Y . Zhou, S. R. Richter, and V . Koltun. Depth Pro: Sharp monocular metric depth in less than a second. InICLR, 2025

  4. [4]

    D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A naturalistic open source movie for optical flow evaluation. InECCV, 2012

  5. [5]

    Chambon, P

    L. Chambon, P. Couairon, E. Zablocki, A. Boulch, N. Thome, and M. Cord. Naf: Zero-shot feature upsampling via neighborhood attention filtering, 2025. URLhttps://arxiv.org/abs/2511.18452

  6. [6]

    Dehghani, J

    M. Dehghani, J. Djolonga, B. Mustafa, P. Padlewski, J. Heek, J. Gilmer, A. P. Steiner, M. Caron, R. Geirhos, I. Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. InICML, 2023

  7. [7]

    InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    M. Deitke, D. Schwenk, J. Salvador, L. Weihs, O. Michel, E. VanderBilt, L. Schmidt, K. Ehsani, A. Kemb- havi, and A. Farhadi. Objaverse: A universe of annotated 3d objects.arXiv preprint arXiv:2212.08051, 2022

  8. [8]

    Z. Ding, Y . Zhang, C. Zhu, G. Zhang, X. Li, N. Jiang, Y . Que, Y . Peng, and X. Guan. Cat-unet: An enhanced u-net architecture with coordinate attention and skip-neighborhood attention transformer for medical image segmentation.Information Sciences, 2024

  9. [9]

    Dosovitskiy, L

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Min- derer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR, 2021

  10. [10]

    Downs, A

    L. Downs, A. Francis, N. Koenig, B. Kinman, R. Hickman, K. Reymann, T. B. McHugh, and V . Vanhoucke. Google scanned objects: A high-quality dataset of 3D scanned household items. InICRA, 2022

  11. [11]

    Fonder and M

    M. Fonder and M. V . Droogenbroeck. Mid-air: A multi-modal dataset for extremely low altitude drone flights. InCVPR Workshops, 2019. 10

  12. [12]

    Guizilini, R

    V . Guizilini, R. Ambrus, S. Pillai, A. Raventos, and A. Gaidon. 3D packing for self-supervised monocular depth estimation. InCVPR, 2020

  13. [13]

    J. L. Gómez, M. Silva, A. Seoane, A. Borràs, M. Noriega, G. Ros, J. A. Iglesias-Guitian, and A. M. López. All for one, and one for all: Urbansyn dataset, the third musketeer of synthetic driving scenes. Neurocomputing, 2025

  14. [14]

    Hassani and H

    A. Hassani and H. Shi. Dilated neighborhood attention transformer.arXiv preprint arXiv:2209.15001, 2022

  15. [15]

    Hassani, S

    A. Hassani, S. Walton, J. Li, S. Li, and H. Shi. Neighborhood attention transformer. InCVPR, 2023

  16. [16]

    Hassani, W.-M

    A. Hassani, W.-M. Hwu, and H. Shi. Faster neighborhood attention: Reducing the O(n2) cost of self attention at the threadblock level. InAdvances in Neural Information Processing Systems, 2024

  17. [17]

    Hassani, F

    A. Hassani, F. Zhou, A. Kane, J. Huang, C.-Y . Chen, M. Shi, S. Walton, M. Hoehnerbach, V . Thakkar, M. Isaev, et al. Generalized neighborhood attention: Multi-dimensional sparse attention at the speed of light.arXiv preprint arXiv:2504.16922, 2025

  18. [18]

    Hernandez-Juarez, L

    D. Hernandez-Juarez, L. Schneider, A. Espinosa, D. Vazquez, A. M. Lopez, U. Franke, M. Pollefeys, and J. C. Moure. Slanted stixels: Representing san francisco’s steepest streets. InBMVC, 2017

  19. [19]

    Huang, K

    P.-H. Huang, K. Matzen, J. Kopf, N. Ahuja, and J.-B. Huang. Deepmvs: Learning multi-view stereopsis. InCVPR, 2018

  20. [20]

    Keetha, N

    N. Keetha, N. Müller, J. Schönberger, L. Porzi, Y . Zhang, T. Fischer, A. Knapitsch, D. Zauss, E. Weber, N. Antunes, J. Luiten, M. Lopez-Antequera, S. R. Bulò, C. Richardt, D. Ramanan, S. Scherer, and P. Kontschieder. MapAnything: Universal feed-forward metric 3D reconstruction. In3DV, 2026

  21. [21]

    T. Koch, L. Liebel, F. Fraundorfer, and M. Körner. Evaluation of CNN-based single-image depth estimation methods. InECCV Workshops, 2018

  22. [22]

    Y . Li, L. Jiang, L. Xu, Y . Xiangli, Z. Wang, D. Lin, and B. Dai. Matrixcity: A large-scale city dataset for city-scale neural rendering and beyond. InICCV, 2023

  23. [23]

    Li and N

    Z. Li and N. Snavely. MegaDepth: Learning single-view depth prediction from internet photos. InCVPR, 2018

  24. [24]

    H. Lin, S. Chen, J. H. Liew, D. Y . Chen, Z. Li, G. Shi, J. Feng, and B. Kang. Depth Anything 3: recovering the visual space from any views. InICLR, 2026

  25. [25]

    H. Liu, B. Li, C. Liu, and M. Lu. Dinat-ir: Exploring dilated neighborhood attention for high-quality image restoration.arXiv preprint arXiv:2507.17892, 2025

  26. [26]

    Loshchilov and F

    I. Loshchilov and F. Hutter. Decoupled weight decay regularization. InICLR, 2019

  27. [27]

    Niklaus, L

    S. Niklaus, L. Mai, J. Yang, and F. Liu. 3d ken burns effect from a single image.ACM TOG, 2019

  28. [28]

    Odena, V

    A. Odena, V . Dumoulin, and C. Olah. Deconvolution and checkerboard artifacts.Distill, 2016

  29. [29]

    Oquab, T

    M. Oquab, T. Darcet, T. Moutakanni, H. V . V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y . Huang, S.-W. Li, I. Misra, M. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski. DINOv2: Learning robust visual features without s...

  30. [30]

    Peebles and S

    W. Peebles and S. Xie. Scalable diffusion models with transformers. InICCV, 2023

  31. [31]

    Piccinelli, C

    L. Piccinelli, C. Sakaridis, Y .-H. Yang, M. Segu, S. Li, W. Abbeloos, and L. V . Gool. UniDepthV2: Universal monocular metric depth estimation made simpler.IEEE TPAMI, 2026

  32. [32]

    L. Qiu, G. Chen, X. Gu, Q. Zuo, M. Xu, Y . Wu, W. Yuan, Z. Dong, L. Bo, and X. Han. Richdreamer: A generalizable normal-depth diffusion model for detail richness in text-to-3d. InCVPR, 2024

  33. [33]

    Ranftl, A

    R. Ranftl, A. Bochkovskiy, and V . Koltun. Vision transformers for dense prediction. InICCV, 2021

  34. [34]

    Ranftl, K

    R. Ranftl, K. Lasinger, D. Hafner, K. Schindler, and V . Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer.IEEE TPAMI, 2022

  35. [35]

    Roberts, J

    M. Roberts, J. Ramapuram, A. Ranjan, A. Kumar, M. A. Bautista, N. Paczan, R. Webb, and J. M. Susskind. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. InICCV, 2021. 11

  36. [36]

    Saadati, O

    D. Saadati, O. N. Manzari, and S. Mirzakuchaki. Dilated-unet: A fast and accurate medical image segmentation approach using a dilated transformer and u-net architecture.arXiv preprint arXiv:2304.11450, 2023

  37. [37]

    Schöps, J

    T. Schöps, J. L. Schönberger, S. Galliani, T. Sattler, K. Schindler, M. Pollefeys, and A. Geiger. A multi-view stereo benchmark with high-resolution images and multi-camera videos. InCVPR, 2017

  38. [38]

    W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In CVPR, 2016

  39. [39]

    Silberman, D

    N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor segmentation and support inference from RGBD images. InECCV, 2012

  40. [40]

    J. Su, M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 2024

  41. [41]

    P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V . Patnaik, P. Tsui, J. Guo, Y . Zhou, Y . Chai, B. Caine, V . Vasudevan, W. Han, J. Ngiam, H. Zhao, A. Timofeev, S. Ettinger, M. Krivokon, A. Gao, A. Joshi, Y . Zhang, J. Shlens, Z. Chen, and D. Anguelov. Scalability in perception for autonomous driving: Waymo open dataset. InCVPR, 2020

  42. [42]

    F. Tosi, Y . Liao, C. Schmitt, and A. Geiger. SMD-Nets: Stereo mixture density networks. InCVPR, 2021

  43. [43]

    Touvron, M

    H. Touvron, M. Cord, A. Sablayrolles, G. Synnaeve, and H. Jégou. Going deeper with image transformers. InICCV, 2021

  44. [44]

    Uhrig, N

    J. Uhrig, N. Schneider, L. Schneider, U. Franke, T. Brox, and A. Geiger. Sparsity invariant CNNs. In3DV, 2017

  45. [45]

    Diode: A dense indoor and outdoor depth dataset.arXiv preprint arXiv:1908.00463, 2019

    I. Vasiljevic, N. I. Kolkin, S. Zhang, R. Luo, H. Wang, F. Z. Dai, A. F. Daniele, M. Mostajabi, S. Basart, M. R. Walter, and G. Shakhnarovich. DIODE: A dense indoor and outdoor DEpth dataset.CoRR, abs/1908.00463, 2019

  46. [46]

    Vaswani, N

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. InNeurIPS, 2017

  47. [47]

    J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny. VGGT: Visual geometry grounded transformer. InCVPR, 2025

  48. [48]

    Wang and S

    K. Wang and S. Shen. Flow-motion and depth network for monocular stereo and beyond.IEEE Robotics and Automation Letters, 2020

  49. [49]

    Q. Wang, S. Zheng, Q. Yan, F. Deng, K. Zhao, and X. Chu. Irs: A large naturalistic indoor robotics stereo dataset to train deep models for disparity and surface normal estimation. InICME, 2021

  50. [50]

    R. Wang, S. Xu, C. Dai, J. Xiang, Y . Deng, X. Tong, and J. Yang. MoGe: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision. InCVPR, 2025

  51. [51]

    R. Wang, S. Xu, Y . Dong, Y . Deng, J. Xiang, Z. Lv, G. Sun, X. Tong, and J. Yang. MoGe-2: Accurate monocular geometry with metric scale and sharp details. InNeurIPS, 2025

  52. [52]

    S. Wang, V . Leroy, Y . Cabon, B. Chidlovskii, and J. Revaud. DUSt3R: Geometric 3d vision made easy. In CVPR, 2024

  53. [53]

    W. Wang, D. Zhu, X. Wang, Y . Hu, Y . Qiu, C. Wang, Y . Hu, A. Kapoor, and S. Scherer. Tartanair: A dataset to push the limits of visual slam. InIROS, 2020

  54. [54]

    Y . Wang, J. Zhou, H. Zhu, W. Chang, Y . Zhou, Z. Li, J. Chen, J. Pang, C. Shen, and T. He.π3: Permutation- equivariant visual geometry learning. InICLR, 2026

  55. [55]

    Weinzaepfel, V

    P. Weinzaepfel, V . Leroy, T. Lucas, R. Brégier, Y . Cabon, V . Arora, L. Antsfeld, B. Chidlovskii, G. Csurka, and J. Revaud. Croco: Self-supervised pre-training for 3d vision tasks by cross-view completion.NeurIPS, 2022

  56. [56]

    Weinzaepfel, T

    P. Weinzaepfel, T. Lucas, V . Leroy, Y . Cabon, V . Arora, R. Brégier, G. Csurka, L. Antsfeld, B. Chidlovskii, and J. Revaud. Croco v2: Improved cross-view completion pre-training for stereo matching and optical flow. InICCV, 2023. 12

  57. [57]

    Wilson, W

    B. Wilson, W. Qi, T. Agarwal, J. Lambert, J. Singh, S. Khandelwal, B. Pan, R. Kumar, A. Hartnett, J. K. Pontes, D. Ramanan, P. Carr, and J. Hays. Argoverse 2: Next generation datasets for self-driving perception and forecasting. InNeurIPS Datasets and Benchmarks Track, 2021

  58. [58]

    G. Xu, H. Lin, H. Luo, X. Wang, J. Yao, L. Zhu, Y . Pu, C. Chi, H. Sun, B. Wang, et al. Pixel-perfect depth with semantics-prompted diffusion transformers.arXiv preprint arXiv:2510.07316, 2025

  59. [59]

    J. Yang, A. Sax, K. J. Liang, M. Henaff, H. Tang, A. Cao, J. Chai, F. Meier, and M. Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass. InCVPR, 2025

  60. [60]

    L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, and H. Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. InCVPR, 2024

  61. [61]

    L. Yang, B. Kang, Z. Huang, Z. Zhao, X. Xu, J. Feng, and H. Zhao. Depth anything v2. InNeurIPS, 2024

  62. [62]

    Y . Yao, Z. Luo, S. Li, J. Zhang, Y . Ren, L. Zhou, T. Fang, and L. Quan. BlendedMVS: A large-scale dataset for generalized multi-view stereo networks. InCVPR, 2020

  63. [63]

    H. Yu, H. Lin, J. Wang, J. Li, Y . Wang, X. Zhang, Y . Wang, X. Zhou, R. Hu, and S. Peng. InfiniDepth: Arbitrary-resolution and fine-grained depth estimation with neural implicit fields. InCVPR, 2026

  64. [64]

    A. R. Zamir, A. Sax, W. B. Shen, L. Guibas, J. Malik, and S. Savarese. Taskonomy: Disentangling task transfer learning. InCVPR, 2018

  65. [65]

    X. Zhai, A. Kolesnikov, N. Houlsby, and L. Beyer. Scaling vision transformers. InCVPR, 2022

  66. [66]

    Zhang and R

    B. Zhang and R. Sennrich. Root mean square layer normalization. InNeurIPS, 2019

  67. [67]

    Zheng, J

    J. Zheng, J. Zhang, J. Li, R. Tang, S. Gao, and Z. Zhou. Structured3D: A large photo-realistic dataset for structured 3d modeling. InECCV, 2020

  68. [68]

    J. Zhu, X. Chen, K. He, Y . LeCun, and Z. Liu. Transformers without normalization. InCVPR, 2025

  69. [69]

    Zolfaghari Bengar, A

    J. Zolfaghari Bengar, A. Gonzalez-Garcia, G. Villalonga, B. Raducanu, H. H. Aghdam, M. Mozerov, A. M. Lopez, and J. van de Weijer. Temporal coherence for active learning in videos. InICCV Workshops, 2019. 13 Appendix A Details on the Point Gradient Matching Loss Pseudocode A gives a Python-like specification ofLpgm in Eq. (3). The loss optimizes the orien...