Recognition: 2 theorem links
· Lean TheoremSS3D: End2End Self-Supervised 3D from Web Videos
Pith reviewed 2026-05-14 21:23 UTC · model grok-4.3
The pith
Pretraining a single feed-forward network on filtered web videos enables joint monocular estimation of depth, ego-motion, and intrinsics with strong zero-shot transfer.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By combining a multi-view signal proxy for video filtering and curriculum sampling with an intrinsics-first training schedule, a feed-forward model can be trained end-to-end on web-scale video to predict depth, ego-motion, and intrinsics together; the resulting checkpoint exhibits strong cross-domain zero-shot performance and outperforms prior self-supervised baselines after fine-tuning.
What carries the argument
The multi-view signal proxy (MVS), which filters unconstrained web videos and performs curriculum sampling to supply stable SfM supervision signals despite weak multi-view observability.
If this is right
- A single checkpoint can be used for coherent end-to-end 3D estimation without separate heads or post-processing steps.
- Zero-shot transfer to new video domains becomes competitive with supervised methods trained on those domains.
- Fine-tuning on limited labeled data yields higher accuracy than fine-tuning from prior self-supervised checkpoints.
- Joint prediction of depth, ego-motion, and intrinsics remains stable under a unified evaluation protocol.
- The released pretrained model can serve as a drop-in starting point for downstream monocular 3D tasks.
Where Pith is reading between the lines
- The same filtering and curriculum logic could be applied to other large unlabeled video sources such as social media or surveillance archives to further scale pretraining.
- The learned 3D priors might transfer to related tasks like visual odometry or novel-view synthesis without additional supervision.
- If the MVS proxy generalizes, future work could test whether even larger corpora produce monotonic gains in cross-domain robustness.
- Robotics and AR systems could adopt the released checkpoint for real-time monocular 3D without domain-specific retraining.
Load-bearing premise
The multi-view signal proxy can consistently select and order web videos so that they supply enough consistent multi-view geometry for stable self-supervised training.
What would settle it
Run the same model and training schedule on the unfiltered YouTube-8M corpus without MVS selection and measure whether zero-shot depth and ego-motion accuracy on held-out domains falls to the level of prior self-supervised baselines or training diverges.
Figures
read the original abstract
We present SS3D, a web-scale SfM-based self-supervision pretraining pipeline for feed-forward 3D estimation from monocular video. Our model jointly predicts depth, ego-motion, and intrinsics in a single forward pass and is trained/evaluated as a coherent end-to-end 3D estimator. To stabilize joint learning, we use an intrinsics-first two-stage schedule and a unified single-checkpoint evaluation protocol. Scaling SfM self-supervision to unconstrained web video is challenging due to weak multi-view observability and strong corpus heterogeneity; we address these with a multi-view signal proxy (MVS) used for filtering and curriculum sampling, and with expert training distilled into a single student. Pretraining on YouTube-8M (~100M frames after filtering) yields strong cross-domain zero-shot transfer and improved fine-tuning performance over prior self-supervised baselines. We release the pretrained checkpoint and code.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SS3D, a self-supervised pretraining pipeline for end-to-end monocular 3D estimation that jointly predicts depth, ego-motion, and camera intrinsics from video. It scales SfM supervision to ~100M frames from filtered YouTube-8M web videos using a multi-view signal proxy (MVS) for filtering and curriculum sampling, an intrinsics-first two-stage training schedule, and expert distillation into a single student model. The central claim is that this yields strong cross-domain zero-shot transfer and improved fine-tuning performance over prior self-supervised baselines, with code and checkpoint released.
Significance. If the empirical claims hold under standard controls, the work would demonstrate a practical route to web-scale self-supervision for feed-forward 3D models, reducing dependence on curated multi-view datasets and improving generalization. The release of reproducible code and a single checkpoint is a clear strength that supports follow-up research.
major comments (3)
- [Method (MVS proxy description)] The MVS filtering and curriculum-sampling procedure (described in the method section) is load-bearing for the scaling claim, yet the manuscript provides no quantitative validation that MVS scores correlate with downstream SfM stability metrics such as successful reconstruction rate, median reprojection error, or depth consistency on held-out web videos. Without an ablation comparing filtered vs. unfiltered YouTube-8M subsets or a correlation plot, it remains possible that reported gains arise primarily from dataset scale rather than the proposed proxy.
- [Training schedule and distillation] The two-stage intrinsics schedule and expert-distillation step presuppose that the MVS-filtered signal is already sufficiently clean for joint end-to-end optimization. The paper should report an ablation (e.g., in the experiments section) that isolates the contribution of each component by training a single-stage baseline on the same filtered corpus and measuring zero-shot transfer degradation.
- [Experiments (zero-shot evaluation)] The zero-shot cross-domain transfer results rely on a unified single-checkpoint evaluation protocol, but the manuscript does not detail how domain-specific intrinsics or motion statistics are handled at test time. A concrete failure case or per-domain breakdown would be needed to confirm that the reported improvements are not artifacts of evaluation protocol differences with prior baselines.
minor comments (2)
- [Method] Notation for the MVS proxy score should be defined explicitly with an equation rather than described only in prose, to allow readers to reproduce the filtering thresholds.
- [Experiments] The abstract states '~100M frames after filtering'; the exact filtering ratio and final dataset statistics should appear in a table in the experiments section for transparency.
Simulated Author's Rebuttal
We thank the referee for their detailed review and positive assessment of the work's potential impact. We have carefully considered each major comment and provide point-by-point responses below. Where appropriate, we have revised the manuscript to include additional ablations, clarifications, and analyses to strengthen the claims.
read point-by-point responses
-
Referee: The MVS filtering and curriculum-sampling procedure (described in the method section) is load-bearing for the scaling claim, yet the manuscript provides no quantitative validation that MVS scores correlate with downstream SfM stability metrics such as successful reconstruction rate, median reprojection error, or depth consistency on held-out web videos. Without an ablation comparing filtered vs. unfiltered YouTube-8M subsets or a correlation plot, it remains possible that reported gains arise primarily from dataset scale rather than the proposed proxy.
Authors: We agree that quantitative validation of the MVS proxy would strengthen the method section. In the revised manuscript, we include a new subsection with correlation analysis between MVS scores and SfM reconstruction metrics on held-out videos, as well as performance comparison on filtered versus unfiltered data subsets. These additions show that the filtering improves stability beyond mere scale. revision: yes
-
Referee: The two-stage intrinsics schedule and expert-distillation step presuppose that the MVS-filtered signal is already sufficiently clean for joint end-to-end optimization. The paper should report an ablation (e.g., in the experiments section) that isolates the contribution of each component by training a single-stage baseline on the same filtered corpus and measuring zero-shot transfer degradation.
Authors: We acknowledge the need for isolating the contributions of the training components. We have added an ablation study in the experiments section comparing the full two-stage schedule with distillation against a single-stage baseline trained on the same filtered corpus. The results confirm the benefits of each component for zero-shot transfer performance. revision: yes
-
Referee: The zero-shot cross-domain transfer results rely on a unified single-checkpoint evaluation protocol, but the manuscript does not detail how domain-specific intrinsics or motion statistics are handled at test time. A concrete failure case or per-domain breakdown would be needed to confirm that the reported improvements are not artifacts of evaluation protocol differences with prior baselines.
Authors: We clarify that at test time, the model relies on its own predicted intrinsics and ego-motion without domain-specific priors. In the revised manuscript, we have added a per-domain breakdown of the zero-shot results and included a discussion of representative failure cases to demonstrate the robustness of the unified evaluation protocol. revision: yes
Circularity Check
No circularity: empirical pipeline relies on external SfM and verifiable pretraining results.
full rationale
The paper describes an end-to-end self-supervised pipeline that filters web videos via an MVS proxy, applies SfM-based supervision, and reports empirical zero-shot transfer and fine-tuning gains on YouTube-8M. No equation or claim reduces a reported prediction to a fitted parameter or self-citation by construction; the method uses external SfM tools, releases code and checkpoints, and evaluates on held-out domains. The central claims are therefore falsifiable against independent benchmarks rather than tautological.
Axiom & Free-Parameter Ledger
free parameters (1)
- MVS filtering and curriculum thresholds
axioms (1)
- domain assumption Structure-from-motion provides reliable depth and ego-motion signals even on unconstrained web video after filtering
invented entities (1)
-
multi-view signal proxy (MVS)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
Photometric Self-Supervision. The reconstruction loss ... ρ(·) is a robust penalty (Charbonnier + SSIM). ... multi-view objective Ψ = Σ Ψi→j
-
IndisputableMonolith/Foundation/BranchSelection.leanRCLCombiner_isCoupling_iff unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Multi-View Signal Proxy (MVS) ... Pt,t+1 = rH / rF ... MVS(v) = average Pt,t+1
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
YouTube-8M: A Large-Scale Video Classification Benchmark
Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B., Vijayanarasimhan, S.: Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675 (2016)
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[2]
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
Assran, M., Bardes, A., Fan, D., Garrido, Q., Howes, R., Muckley, M., Rizvi, A., Roberts, C., Sinha, K., Zholus, A., et al.: V-jepa 2: Self-supervised video models enable understanding, prediction and planning. arXiv preprint arXiv:2506.09985 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Bhat, S.F., Alhashim, I., Wonka, P.: Adabins: Depth estimation using adaptive bins. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4009–4018 (2021)
work page 2021
-
[4]
ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth
Bhat,S.F.,Birkl,R.,Wofk,D.,Wonka,P.,Müller,M.:Zoedepth:Zero-shottransfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288 (2023)
work page internal anchor Pith review arXiv 2023
-
[5]
Advances in neural information processing systems32(2019)
Bian, J., Li, Z., Wang, N., Zhan, H., Shen, C., Cheng, M.M., Reid, I.: Unsupervised scale-consistent depth and ego-motion learning from monocular video. Advances in neural information processing systems32(2019)
work page 2019
-
[6]
1–a model zoo for robust monocular relative depth estimation
Birkl, R., Wofk, D., Müller, M.: Midas v3. 1–a model zoo for robust monocular relative depth estimation. arXiv preprint arXiv:2307.14460 (2023)
-
[7]
In: European conference on computer vision
Butler, D.J., Wulff, J., Stanley, G.B., Black, M.J.: A naturalistic open source movie for optical flow evaluation. In: European conference on computer vision. pp. 611–
-
[8]
In: European Conference on Computer Vision
Chen, T., An, S., Zhang, Y., Ma, C., Wang, H., Guo, X., Zheng, W.: Improving monocular depth estimation by leveraging structural awareness and complemen- tary datasets. In: European Conference on Computer Vision. pp. 90–108. Springer (2020)
work page 2020
-
[9]
In: Proceed- ings of the IEEE/CVF international conference on computer vision
Chen, Y., Schmid, C., Sminchisescu, C.: Self-supervised learning with geometric constraints in monocular video: Connecting flow, depth, and camera. In: Proceed- ings of the IEEE/CVF international conference on computer vision. pp. 7063–7072 (2019) 16 M. Hariat et al
work page 2019
-
[10]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[11]
arXiv preprint arXiv:2312.01283 (2023)
Fan, C., Yin, Z., Li, Y., Zhang, F.: Deeper into self-supervised monocular indoor depth estimation. arXiv preprint arXiv:2312.01283 (2023)
-
[12]
The international journal of robotics research32(11), 1231–1237 (2013)
Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: The kitti dataset. The international journal of robotics research32(11), 1231–1237 (2013)
work page 2013
-
[13]
In: Proceedings of the IEEE/CVF in- ternational conference on computer vision
Godard, C., Mac Aodha, O., Firman, M., Brostow, G.J.: Digging into self- supervised monocular depth estimation. In: Proceedings of the IEEE/CVF in- ternational conference on computer vision. pp. 3828–3838 (2019)
work page 2019
-
[14]
In: Proceedings of the IEEE/CVF international conference on computer vision
Gordon, A., Li, H., Jonschkowski, R., Angelova, A.: Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 8977–8986 (2019)
work page 2019
-
[15]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Guizilini, V., Ambrus, R., Pillai, S., Raventos, A., Gaidon, A.: 3d packing for self-supervised monocular depth estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 2485–2494 (2020)
work page 2020
-
[16]
arXiv preprint arXiv:2002.12319 (2020)
Guizilini, V., Hou, R., Li, J., Ambrus, R., Gaidon, A.: Semantically-guided representation learning for self-supervised monocular depth. arXiv preprint arXiv:2002.12319 (2020)
-
[17]
In: Pro- ceedings of the IEEE/CVF winter conference on applications of computer vision
Hariat, M., Manzanera, A., Filliat, D.: Rebalancing gradient to improve self- supervised co-training of depth, odometry and optical flow predictions. In: Pro- ceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 1267–1276 (2023)
work page 2023
-
[18]
In: Proceedings of the Computer Vision and Pattern Recognition Confer- ence
Hariat, M., Manzanera, A., Filliat, D.: Improved monocular depth prediction us- ing distance transform over pre-semantic contours with self-supervised neural net- works. In: Proceedings of the Computer Vision and Pattern Recognition Confer- ence. pp. 21868–21879 (2025)
work page 2025
-
[19]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16000–16009 (2022)
work page 2022
-
[20]
He, M., Hui, L., Bian, Y., Ren, J., Xie, J., Yang, J.: Ra-depth: Resolution adaptive self-supervisedmonoculardepthestimation.In:EuropeanConferenceonComputer Vision. pp. 565–581. Springer (2022)
work page 2022
-
[21]
IEEE Transactions on Pattern Analysis and Machine Intelligence30(3), 548–554 (2008)
Hernandez, C., Vogiatzis, G., Cipolla, R.: Multiview photometric stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence30(3), 548–554 (2008)
work page 2008
-
[22]
arXiv preprint arXiv:2106.03505 (2021)
Jia, S., Pei, X., Yao, W., Wong, S.C.: Self-supervised depth estimation leveraging global perception and geometric smoothness using on-board videos. arXiv preprint arXiv:2106.03505 (2021)
-
[23]
MapAnything: Universal Feed-Forward Metric 3D Reconstruction
Keetha, N., Müller, N., Schönberger, J., Porzi, L., Zhang, Y., Fischer, T., Knapitsch, A., Zauss, D., Weber, E., Antunes, N., et al.: Mapanything: Univer- sal feed-forward metric 3d reconstruction. arXiv preprint arXiv:2509.13414 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
Adam: A Method for Stochastic Optimization
Kingma, D.P.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[25]
In: European conference on computer vision
Klingner, M., Termöhlen, J.A., Mikolajczyk, J., Fingscheidt, T.: Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance. In: European conference on computer vision. pp. 582–600. Springer (2020)
work page 2020
-
[26]
In: Proceedings of the AAAI conference on artificial intelligence
Lee, S., Im, S., Lin, S., Kweon, I.S.: Learning monocular depth in dynamic scenes via instance-aware projection consistency. In: Proceedings of the AAAI conference on artificial intelligence. vol. 35, pp. 1863–1872 (2021) Abbreviated paper title 17
work page 2021
-
[27]
In: Proceedings of the IEEE/CVF international conference on computer vision
Li, B., Huang, Y., Liu, Z., Zou, D., Yu, W.: Structdepth: Leveraging the struc- tural regularities for self-supervised indoor depth estimation. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 12663–12673 (2021)
work page 2021
-
[28]
In: Conference on Robot Learning
Li, H., Gordon, A., Zhao, H., Casser, V., Angelova, A.: Unsupervised monocular depth learning in dynamic scenes. In: Conference on Robot Learning. pp. 1908–
work page 1908
-
[29]
Pattern Recognition137, 109297 (2023)
Li, R., Xue, D., Su, S., He, X., Mao, Q., Zhu, Y., Sun, J., Zhang, Y.: Learning depth via leveraging semantics: Self-supervised monocular depth estimation with both implicit and explicit semantic guidance. Pattern Recognition137, 109297 (2023)
work page 2023
-
[30]
IEEE Transac- tions on Circuits and Systems for Video Technology33(2), 830–846 (2022)
Li, R., Ji, P., Xu, Y., Bhanu, B.: Monoindoor++: Towards better practice of self- supervised monocular depth estimation for indoor environments. IEEE Transac- tions on Circuits and Systems for Video Technology33(2), 830–846 (2022)
work page 2022
-
[31]
Depth Anything 3: Recovering the Visual Space from Any Views
Lin, H., Chen, S., Liew, J., Chen, D.Y., Li, Z., Shi, G., Feng, J., Kang, B.: Depth anything 3: Recovering the visual space from any views. arXiv preprint arXiv:2511.10647 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
In: Proceedings of the AAAI conference on artificial intelligence
Lyu, X., Liu, L., Wang, M., Kong, X., Liu, L., Liu, Y., Chen, X., Yuan, Y.: Hr- depth: High resolution self-supervised monocular depth estimation. In: Proceedings of the AAAI conference on artificial intelligence. vol. 35, pp. 2294–2301 (2021)
work page 2021
-
[33]
DINOv2: Learning Robust Visual Features without Supervision
Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
Advances in neural information processing sys- tems32(2019)
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: An imperative style, high- performance deep learning library. Advances in neural information processing sys- tems32(2019)
work page 2019
-
[35]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Piccinelli, L., Yang, Y.H., Sakaridis, C., Segu, M., Li, S., Van Gool, L., Yu, F.: Unidepth: Universal monocular metric depth estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10106– 10116 (2024)
work page 2024
-
[36]
Multimedia Tools and Applications82(27), 41641–41667 (2023)
Pinard, C., Manzanera, A.: Does it work outside this benchmark? introducing the rigid depth constructor tool: Depth validation dataset construction in rigid scenes for the masses. Multimedia Tools and Applications82(27), 41641–41667 (2023)
work page 2023
-
[37]
In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition
Poggi, M., Aleotti, F., Tosi, F., Mattoccia, S.: On the uncertainty of self-supervised monocular depth estimation. In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition. pp. 3227–3237 (2020)
work page 2020
-
[38]
In: International conference on machine learning
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)
work page 2021
-
[39]
Rajapaksha, U., Sohel, F., Laga, H., Diepeveen, D., Bennamoun, M.: Deep learning-based depth estimation methods from monocular image and videos: A comprehensive survey. ACM Comput. Surv.56(12) (oct 2024).https://doi.org/ 10.1145/3677327
-
[40]
Nature331(6152), 163– 166 (1988)
Ramachandran, V.S.: Perception of shape from shading. Nature331(6152), 163– 166 (1988)
work page 1988
-
[41]
In: Proceedings of the IEEE/CVF international conference on computer vision
Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 12179–12188 (2021) 18 M. Hariat et al
work page 2021
-
[42]
IEEE transactions on pattern analysis and machine intelligence44(3), 1623–1637 (2020)
Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., Koltun, V.: Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE transactions on pattern analysis and machine intelligence44(3), 1623–1637 (2020)
work page 2020
-
[43]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Ranjan, A., Jampani, V., Balles, L., Kim, K., Sun, D., Wulff, J., Black, M.J.: Com- petitivecollaboration:Jointunsupervisedlearningofdepth,cameramotion,optical flow and motion segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12240–12249 (2019)
work page 2019
- [44]
-
[45]
International journal of computer vision115(3), 211–252 (2015)
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recog- nition challenge. International journal of computer vision115(3), 211–252 (2015)
work page 2015
-
[46]
In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision
Saeedan, F., Roth, S.: Boosting monocular depth with panoptic segmentation maps. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 3853–3862 (2021)
work page 2021
-
[47]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4104–4113 (2016)
work page 2016
-
[48]
Ad- vances in neural information processing systems31(2018)
Sener, O., Koltun, V.: Multi-task learning as multi-objective optimization. Ad- vances in neural information processing systems31(2018)
work page 2018
-
[49]
In: European Conference on Computer Vision
Shu, C., Yu, K., Duan, Z., Yang, K.: Feature-metric loss for self-supervised learning of depth and egomotion. In: European Conference on Computer Vision. pp. 572–
-
[50]
In: European conference on computer vision
Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from rgbd images. In: European conference on computer vision. pp. 746–
-
[51]
In: 2012 IEEE/RSJ international conference on intelligent robots and systems
Sturm, J., Engelhard, N., Endres, F., Burgard, W., Cremers, D.: A benchmark for the evaluation of rgb-d slam systems. In: 2012 IEEE/RSJ international conference on intelligent robots and systems. pp. 573–580. IEEE (2012)
work page 2012
-
[52]
In: Pro- ceedingsofIEEEInternationalConferenceonComputerVision.pp.862–869.IEEE (2003)
Tankus, Sochen, Yeshurun: A new perspective [on] shape-from-shading. In: Pro- ceedingsofIEEEInternationalConferenceonComputerVision.pp.862–869.IEEE (2003)
work page 2003
-
[53]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 5294–5306 (2025)
work page 2025
-
[54]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Wang, S., Leroy, V., Cabon, Y., Chidlovskii, B., Revaud, J.: Dust3r: Geometric 3d vision made easy. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 20697–20709 (2024)
work page 2024
-
[55]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Wang, Y., Yue, Y., Lu, R., Liu, T., Zhong, Z., Song, S., Huang, G.: Efficient- train: Exploring generalized curriculum learning for training visual backbones. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 5852–5864 (2023)
work page 2023
-
[56]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Watson, J., Mac Aodha, O., Prisacariu, V., Brostow, G., Firman, M.: The temporal opportunist: Self-supervised multi-frame monocular depth. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 1164–1174 (2021)
work page 2021
-
[57]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Wimbauer, F., Chen, W., Muhle, D., Rupprecht, C., Cremers, D.: Anycam: Learn- ing to recover camera poses and intrinsics from casual videos. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 16717–16727 (2025) Abbreviated paper title 19
work page 2025
-
[58]
In: Proceed- ings of the AAAI Conference on Artificial Intelligence
Xie, Z., Zhang, Y., Zhuang, C., Shi, Q., Liu, Z., Gu, J., Zhang, G.: Mode: A mixture-of-experts model with mutual distillation among the experts. In: Proceed- ings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 16067–16075 (2024)
work page 2024
-
[59]
Remote Sensing13(9), 1673 (2021)
Xu, W., Zou, L., Wu, L., Fu, Z.: Self-supervised monocular depth learning in low- texture areas. Remote Sensing13(9), 1673 (2021)
work page 2021
-
[60]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Yang, L., Kang, B., Huang, Z., Xu, X., Feng, J., Zhao, H.: Depth anything: Un- leashing the power of large-scale unlabeled data. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10371–10381 (2024)
work page 2024
-
[61]
In: Proceedings of the IEEE conference on computer vision and pat- tern recognition
Yin, Z., Shi, J.: Geonet: Unsupervised learning of dense depth, optical flow and camera pose. In: Proceedings of the IEEE conference on computer vision and pat- tern recognition. pp. 1983–1992 (2018)
work page 1983
-
[62]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Zhan, H., Garg, R., Weerasekera, C.S., Li, K., Agarwal, H., Reid, I.: Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 340–349 (2018)
work page 2018
-
[63]
In: 2022 international conference on 3D vision (3DV)
Zhao, C., Zhang, Y., Poggi, M., Tosi, F., Guo, X., Zhu, Z., Huang, G., Tang, Y., Mattoccia, S.: Monovit: Self-supervised monocular depth estimation with a vision transformer. In: 2022 international conference on 3D vision (3DV). pp. 668–678. IEEE (2022)
work page 2022
-
[64]
arXiv preprint arXiv:2110.09482 (2021)
Zhou, H., Greenwood, D., Taylor, S.: Self-supervised monocular depth estimation with internal feature fusion. arXiv preprint arXiv:2110.09482 (2021)
-
[65]
In: British Machine Vision Conference (BMVC) (2021)
Zhou, H., Greenwood, D., Taylor, S.: Self-supervised monocular depth estimation with internal feature fusion. In: British Machine Vision Conference (BMVC) (2021)
work page 2021
-
[66]
In: Proceedings of the IEEE/CVF interna- tional conference on computer vision
Zhou, J., Wang, Y., Qin, K., Zeng, W.: Moving indoor: Unsupervised video depth learning in challenging environments. In: Proceedings of the IEEE/CVF interna- tional conference on computer vision. pp. 8618–8627 (2019)
work page 2019
-
[67]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1851–1858 (2017)
work page 2017
-
[68]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Zhu, S., Brazil, G., Liu, X.: The edge of depth: Explicit constraints between seg- mentation and depth. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 13116–13125 (2020)
work page 2020
-
[69]
In: Proceedings of the European conference on computer vision (ECCV)
Zou, Y., Luo, Z., Huang, J.B.: Df-net: Unsupervised joint learning of depth and flow using cross-task consistency. In: Proceedings of the European conference on computer vision (ECCV). pp. 36–53 (2018)
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.