Introduction to Camera Pose Estimation with Deep Learning

Ron Ferens; Yoli Shavit

arxiv: 1907.05272 · v3 · pith:VITHFGVBnew · submitted 2019-07-08 · 💻 cs.CV

Introduction to Camera Pose Estimation with Deep Learning

Yoli Shavit , Ron Ferens This is my paper

Pith reviewed 2026-05-25 01:06 UTC · model grok-4.3

classification 💻 cs.CV

keywords camera pose estimationdeep learningpose regressionRGB imagesvisual localizationcomputer visionlearning-based methodsreproducibility

0 comments

The pith

Deep learning for camera pose estimation started with direct RGB regression and has since produced identifiable trends plus comparable implementations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reviews the body of deep learning work on regressing absolute camera pose from single RGB images. It begins with the first regression networks, which underperformed classic feature-based methods yet prompted many follow-on papers. The authors catalog key techniques, isolate recurring strategies meant to raise accuracy, and present a side-by-side comparison of published estimators together with notes on running them. They close by noting newer directions and open questions.

Core claim

Although the initial deep convolutional regression of camera pose from RGB images produced lower accuracy than established feature-based pipelines, it initiated a wave of learning-based estimators. The review catalogs these methods, identifies the main directions taken to improve the original regression, supplies a cross-comparison with reproducibility details, and outlines emerging approaches.

What carries the argument

Deep pose regression from RGB images, treated as the baseline whose limitations subsequent methods address through specific trends.

If this is right

Practitioners can consult the cross-comparison to select an estimator suited to their accuracy and runtime needs.
The supplied execution notes lower the barrier to reproducing reported results.
Identified trends such as geometric constraints or multi-task training indicate concrete routes for further accuracy gains.
Discussion of emerging solutions frames immediate next steps for hybrid learning-plus-geometry pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Continued progress along the observed trends could make single-image pose regression competitive with structure-from-motion pipelines in many indoor settings.
The review implicitly shows that transfer from large generic image datasets is a practical way to bootstrap pose estimation when labeled camera data are scarce.
If the reproducibility notes prove sufficient, the field may shift from publishing isolated accuracy numbers toward standardized public implementations.

Load-bearing premise

The first deep pose regression paper generated enough follow-up work to justify a coherent review and cross-comparison at this time.

What would settle it

A controlled benchmark in which none of the reviewed learning-based estimators show measurable accuracy gains over the original regression network or over classic feature-based solutions.

Figures

Figures reproduced from arXiv: 1907.05272 by Ron Ferens, Yoli Shavit.

**Figure 1.** Figure 1: A schematization of the PoseNet’s architecture. Given an image 𝐼𝑐 , a dCNN architecture (‘Encoder’) generates visual feature vectors from 𝐼𝑐 . Using a FC layer (‘Localizer’), the visual encoding of 𝐼𝑐 is mapped to a localization feature vector. Finally, two separate connected layers (‘Regressor’) are used to regress 𝑥̂ and 𝑞̂ , respectively, giving the estimated pose 𝑝̂= (𝑥̂, 𝑞̂) . A similar abstraction wa… view at source ↗

**Figure 3.** Figure 3: Example modifications to PoseNet’s architecture. Auxiliary Learning Loss and architecture modifications to PoseNet’s solution led to a significant improvement in its pose error for indoor and outdoor scenes ( [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

read the original abstract

Over the last two decades, deep learning has transformed the field of computer vision. Deep convolutional networks were successfully applied to learn different vision tasks such as image classification, image segmentation, object detection and many more. By transferring the knowledge learned by deep models on large generic datasets, researchers were further able to create fine-tuned models for other more specific tasks. Recently this idea was applied for regressing the absolute camera pose from an RGB image. Although the resulting accuracy was sub-optimal, compared to classic feature-based solutions, this effort led to a surge of learning-based pose estimation methods. Here, we review deep learning approaches for camera pose estimation. We describe key methods in the field and identify trends aiming at improving the original deep pose regression solution. We further provide an extensive cross-comparison of existing learning-based pose estimators, together with practical notes on their execution for reproducibility purposes. Finally, we discuss emerging solutions and potential future research directions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A clear but limited survey that maps DL camera pose methods and flags trends, with the main weakness being an unnormalized cross-comparison table.

read the letter

This paper is a literature survey on deep learning for absolute camera pose regression from single RGB images. It starts with the 2017 PoseNet paper, walks through later attempts to improve accuracy via better losses, scene coordinate regression, relative pose, and hybrid classical-learning pipelines, and ends with a table of reported numbers plus some code-running notes. That organization is the main value: someone new to the area can get a quick sense of the main branches without reading twenty separate papers. The reproducibility notes are also a small but concrete plus. The cross-comparison table is the soft spot. The authors pull published figures rather than re-running the methods on one fixed benchmark, so differences in dataset (7-Scenes vs. Cambridge), resolution, quaternion vs. other representations, and absolute vs. relative formulation are not controlled. The stress-test note is accurate here; you cannot read clean trends off those numbers without those caveats being stated more explicitly. No new derivations or experiments appear, which is expected for a survey but means the contribution is entirely in curation and summary. The writing stays within its stated scope and does not overclaim. This is the kind of paper that helps a graduate student or practitioner get oriented, not one that moves the technical frontier. It is coherent on its own terms and cites the right prior work, so it is worth sending to referees rather than desk-rejecting, provided the comparison limitations are made plain in revision.

Referee Report

1 major / 0 minor

Summary. The manuscript is a survey of deep learning methods for camera pose estimation. It reviews key approaches starting from the initial deep pose regression work, identifies trends aimed at improving accuracy, supplies an extensive cross-comparison of published learning-based estimators together with reproducibility notes, and outlines emerging solutions and future directions.

Significance. A survey that successfully normalizes and interprets results across papers could help consolidate the literature on learning-based pose estimation and highlight reproducible practices; the current version's value is limited by the comparability issues in its central comparison.

major comments (1)

[Cross-comparison section (as described in abstract)] The headline claim of an 'extensive cross-comparison' (abstract) rests on tabulated published numbers rather than re-evaluations on a common benchmark, training protocol, and test split. Pose regression accuracy is known to vary with dataset (7-Scenes vs. Cambridge Landmarks), resolution, quaternion vs. log-map representation, and absolute vs. relative regression; without explicit normalization or flags for non-comparable entries, trends cannot be reliably read from the table.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and constructive feedback on our survey manuscript. We address the single major comment point-by-point below, and we plan to incorporate clarifications in a revised version.

read point-by-point responses

Referee: [Cross-comparison section (as described in abstract)] The headline claim of an 'extensive cross-comparison' (abstract) rests on tabulated published numbers rather than re-evaluations on a common benchmark, training protocol, and test split. Pose regression accuracy is known to vary with dataset (7-Scenes vs. Cambridge Landmarks), resolution, quaternion vs. log-map representation, and absolute vs. relative regression; without explicit normalization or flags for non-comparable entries, trends cannot be reliably read from the table.

Authors: We agree that the tabulated results reflect published numbers under varying experimental conditions rather than a unified re-evaluation, and that factors such as dataset choice, pose representation, and regression type affect direct comparability. As this is a survey paper whose primary aim is to review the literature and identify trends, compiling reported results follows standard practice for such works; performing a full re-implementation and re-training of every method on identical protocols would constitute a separate large-scale experimental study beyond the scope of a survey. Nevertheless, the concern is valid, and we will revise the cross-comparison section to add explicit flags, footnotes, and an expanded discussion that clearly delineate non-comparable entries and the known sources of variation. This will allow readers to interpret the table more cautiously while preserving the overview value of the compilation. revision: yes

Circularity Check

0 steps flagged

Survey paper: no derivations, predictions, or fitted quantities present

full rationale

This is a literature review surveying deep learning methods for camera pose estimation. It describes existing approaches, identifies trends, and tabulates reported results from prior work. No original derivations, first-principles predictions, parameter fitting, or mathematical claims are made that could reduce to self-definition or self-citation. The cross-comparison consists of collected published numbers rather than new fitted outputs, so no circular reduction applies. The paper is self-contained as a descriptive survey against external literature.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a survey paper with no new derivations, parameters, or entities. It summarizes existing work on transfer learning and pose regression.

pith-pipeline@v0.9.0 · 5680 in / 988 out tokens · 69568 ms · 2026-05-25T01:06:00.620523+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 10 internal anchors

[1]

and van der Maaten, L., 2018

Mahajan, D., Girshick, R., Ramanathan, V., He, K., Paluri, M., Li, Y., Bharambe, A. and van der Maaten, L., 2018. Exploring the limits of weakly supervised pretraining. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 181-196)

work page 2018
[2]

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

Tan, M. and Le, Q.V., 2019. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. arXiv preprint arXiv:1905.11946

work page internal anchor Pith review Pith/arXiv arXiv 2019
[3]

and Rabinovich, A.,

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V. and Rabinovich, A.,

work page
[4]

In Proceedings of the IEEE conference on computer vision and pattern recognition (pp

Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1-9)

work page
[5]

and Hall, P., 2016, October

Westlake, N., Cai, H. and Hall, P., 2016, October. Detecting people in artwork with CNNs . In European Conference on Computer Vision (pp. 825-841). Springer, Cham

work page 2016
[6]

and Catanzaro, B., 2019

Zhu, Y., Sapra, K., Reda, F.A., Shih, K.J., Newsam, S., Tao, A. and Catanzaro, B., 2019. Improving Semantic Segmentation via Video Propagation and Label Relaxation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 8856-8865)

work page 2019
[7]

and Cipolla, R., 2017

Badrinarayanan, V., Kendall, A. and Cipolla, R., 2017. Segnet: A deep convolutional encoder -decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence, 39(12), pp.2481-2495

work page 2017
[8]

and Cipolla, R., 2015

Kendall, A., Grimes, M. and Cipolla, R., 2015. Posenet: A convolutional network for real -time 6 -dof camera relocalization. In Proceedings of the IEEE international conference on computer vision (pp. 2938 -2946). https://github.com/alexgkendall/caffe-posenet

work page 2015
[9]

and Kobbelt, L., 2012, October

Sattler, T., Leibe, B. and Kobbelt, L., 2012, October. Improving image -based localization by active correspondence search. In European conference on computer vision (pp. 752-765). Springer, Berlin, Heidelberg

work page 2012
[10]

and Kobbelt, L., 2016

Sattler, T ., Leibe, B. and Kobbelt, L., 2016. Efficient & effective prioritized matching for large -scale image -based localization. IEEE transactions on pattern analysis and machine intelligence, 39(9), pp.1744-1756

work page 2016
[11]

and Hu, X., 2017, May

Wu, J., Ma, L. and Hu, X., 2017, May. Delving deep er into convolutional neural networks for camera relocalization . In 2017 IEEE International Conference on Robotics and Automation (ICRA) (pp. 5644-5651). IEEE

work page 2017
[12]

and Cipolla, R., 2017

Kendall, A. and Cipolla, R., 2017. Geometric loss functions for camera pose regression with deep learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 5974-5983)

work page 2017
[13]

and Szeliski, R., 2006, July

Snavely, N., Seitz, S.M. and Szeliski, R., 2006, July. Photo tourism: exploring photo collections in 3D. In ACM transactions on graphics (TOG) (Vol. 25, No. 3, pp. 835 - 846). ACM

work page 2006
[14]

and Frahm, J.M., 2016

Schonberger, J.L. and Frahm, J.M., 2016. Structure -from- motion revisited. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4104-4113)

work page 2016
[15]

and Frahm, J.M., 2016

Schonberger, J.L. and Frahm, J.M., 2016. Struc ture-from- motion revisited. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4104-4113)

work page 2016
[16]

VisualSFM: A visual structure from motion system

Wu, C., 2011. VisualSFM: A visual structure from motion system. http://www. cs. washington. edu/homes/ccwu/vsfm

work page 2011
[17]

and Kahl, F., 2018

Sattler, T., Maddern, W., Toft, C., Torii, A., Hammarstrand, L., Stenborg, E., Safari, D., Okutomi, M., Pollefeys, M., Sivic, J. and Kahl, F., 2018. Benchmarking 6dof outdoor visual localization in changing conditions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 8601-8610)

work page 2018
[18]

and Li, H., 2013

Hartley, R., Trumpf, J., Dai, Y. and Li, H., 2013. Rotation averaging. International journal of computer vision , 103(3), pp.267-305

work page 2013
[19]

Distinctive image features from scale - invariant keypoints

Lowe, D.G., 2004. Distinctive image features from scale - invariant keypoints. International journal of computer vision, 60(2), pp.91-110

work page 2004
[20]

and Rabinovich, A., 2018

DeTone, D., Malisiewicz, T. and Rabinovich, A., 2018. Superpoint: Self -supervised interest point detection and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (pp. 224-236)

work page 2018
[21]

and Bolles, R.C., 1981

Fischler, M.A. and Bolles, R.C., 1981. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6), pp.381-395

work page 1981
[22]

and Sivic, J., 2016

Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T. and Sivic, J., 2016. NetVLAD: CNN architecture for weakly supervised place recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 5297-5307)

work page 2016
[23]

and Larlus, D., 2016, October

Gordo, A., Almazán, J., Revaud, J. and Larlus, D., 2016, October. Deep image retrieval: Learning global representations for image search. In European conference on computer vision (pp. 241-257). Springer, Cham

work page 2016
[24]

and Philbin, J., 2016, October

Weyand, T., Kostrikov, I. and Philbin, J., 2016, October. Planet-photo geolocation with convolutional neural networks. In European Conference on Computer Vision (pp. 37-55). Springer, Cham

work page 2016
[25]

and Dymczyk, M.,

Sarlin, P.E., Cadena, C., Siegwart, R. and Dymczyk, M.,

work page
[26]

In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp

From coarse to fine: Robust hierarchical localization at large scale. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 12716 - 12725). https://github.com/ethz-asl/hfnet

work page
[27]

Understanding the Limitations of CNN-based Absolute Camera Pose Regression

Sattler, T., Zhou, Q., Pollefeys, M. and Leal-Taixe, L., 2019. Understanding the Limitations of CNN -based Absolute Camera Pose Regression. arXiv preprint arXiv:1903.07504

work page internal anchor Pith review Pith/arXiv arXiv 2019
[28]

and Moreno -Noguer, F., 2015

Simo-Serra, E., Trulls, E., Ferraz, L., Kokkinos, I., Fua, P. and Moreno -Noguer, F., 2015. Discriminative learning of deep convolutional feature point descriptors. In Proceedings of the IEEE Interna tional Conference on Computer Vision (pp. 118-126)

work page 2015
[29]

and Criminisi, A., 2013, October

Glocker, B., Izadi, S., Shotton, J. and Criminisi, A., 2013, October. Real -time RGB-D camera relocalization. In 2013 IEEE International Symposium on Mixed and Augmented Reality (ISMAR) (pp. 173-179). IEEE

work page 2013
[30]

Bayesian Convolutional Neural Networks with Bernoulli Approximate Variational Inference

Gal, Y. and Ghahramani, Z., 2015. Bayesian convolutional neural networks with Bernoulli approximate variational inference. arXiv preprint arXiv:1506.02158

work page internal anchor Pith review Pith/arXiv arXiv 2015
[31]

and Cipolla, R., 2016, May

Kendall, A. and Cipolla, R., 2016, May. Modelling uncertainty in deep learning for camera relocali zation. In 2016 IEEE international conference on Robotics and Automation (ICRA) (pp. 4762 -4769). IEEE. https://github.com/alexgkendall/caffe-posenet

work page 2016
[32]

and Cremers, D., 2017

Walch, F., Hazirbas, C., Leal -Taixe, L., Sattler, T., Hilsenbeck, S. and Cremers, D., 2017. Image -based localization using lstms for structured feature correlation. In Proceedings of the IEEE International Conference on Computer Vision (pp. 627-637)

work page 2017
[33]

and Ra htu, E., 2017

Melekhov, I., Ylioinas, J., Kannala, J. and Ra htu, E., 2017. Image-based localization using hourglass networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 879 -886). https://github.com/AaltoVision/camera-relocalisation

work page 2017
[34]

and Deng, J., 2016, October

Newell, A., Yang, K. and Deng, J., 2016, October. Stacked hourglass networks for human pose estimation. In European Conference on Computer Vision (pp. 483 -499). Springer, Cham

work page 2016
[35]

and Sun, J., 2016

He, K., Zhang, X., Ren, S. and Sun, J., 2016. Deep residual learning for image recognition . In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778)

work page 2016
[36]

and Burgard, W., 2017, September

Naseer, T. and Burgard, W., 2017, September. Deep regression for monocular camera -based 6 -dof global localization in outdoor envi ronments. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (pp. 1525-1530). IEEE

work page 2017
[37]

and Sun, J., 2015

Zhang, X., Zou, J., He, K. and Sun, J., 2015. Accelerating very deep convolutional networks for classification and detection. IEEE transacti ons on pattern analysis and machine intelligence, 38(10), pp.1943-1955

work page 2015
[38]

Kendall, A., Gal, Y., & Cipolla, R. (2018). Multi -task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 7482-7491)

work page 2018
[39]

and Kautz, J., 2018

Brahmbhatt, S., Gu, J., Kim, K., Hays, J. and Kautz, J., 2018. Geometry-aware learning of maps for camera localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2616-2625)

work page 2018
[40]

and Cremers, D., 2017

Engel, J., Koltun, V. and Cremers, D., 2017. Direct sparse odometry. IEEE transactions on pattern analysis and machine intelligence, 40(3), pp.611-625

work page 2017
[41]

and Cremers, D., 2013

Engel, J., Sturm, J. and Cremers, D., 2013. Semi-dense visual odometry for a monocular camera. In Proceedings of the IEEE international conference on computer vision (pp. 1449- 1456). https://github.com/NVlabs/geomapnet

work page 2013
[42]

Rotations, quaternions, and double groups

Altmann, S.L., 2005. Rotations, quaternions, and double groups. Courier Corporation

work page 2005
[43]

and Burgard, W., 2018, May

Valada, A., Radwan, N. and Burgard, W., 2018, May. Deep auxiliary learning for visual localization and odometry. In 2018 IEEE International Conference on R obotics and Automation (ICRA) (pp. 6939-6946). IEEE

work page 2018
[44]

and Hinton, G.E., 2010

Nair, V. and Hinton, G.E., 2010. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10) (pp. 807-814)

work page 2010
[45]

Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)

Clevert, D.A., Unterthiner, T. and Hochreiter, S., 2015. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289

work page internal anchor Pith review Pith/arXiv arXiv 2015
[46]

and Burgard, W., 2018

Radwan, N., Valada, A. and Burgard, W., 2018. Vlocnet++: Deep multitask learning for semantic visual localization and odometry. IEEE Robotics and Automation Letters , 3(4), pp.4407-4414

work page 2018
[47]

Deep Global-Relative Networks for End-to-End 6-DoF Visual Localization and Odometry

Lin, Y., Liu, Z., Huang, J., Wang, C., Du, G., Bai, J., Lian, S. and Huang, B., 2018. Deep Global -Relative Networks for End-to-End 6-DoF Visual Localization and Odometry. arXiv preprint arXiv:1812.07869

work page internal anchor Pith review Pith/arXiv arXiv 2018
[48]

and Wen, H.,

Clark, R., Wang, S., Markham, A., Trigoni, N. and Wen, H.,

work page
[49]

In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp

Vidloc: A deep spatio-temporal model for 6-dof video- clip relocalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp . 6856 - 6864)

work page
[50]

and Shammah, S., 2017, August

Shalev-Shwartz, S., Shamir, O. and Shammah, S., 2017, August. Failures of gradient -based deep learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70 (pp. 3067-3075). JMLR. org

work page 2017
[51]

and Mayol -Cuevas, W., 2018

Contreras, L. and Mayol -Cuevas, W., 2018. Towards CNN map representation and compression for camera relocalisation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (pp. 292-299)

work page 2018
[52]

and Kannala, J., 2017

Laskar, Z., Melekhov, I., Kalia, S. and Kannala, J., 2017. Camera relocalization by computing pairwise relative poses using convolutional neural network. In Proceedings of the IEEE International Conference on Computer Vision (pp. 929-938). https://github.com/AaltoVision/camera- relocalisation

work page 2017
[53]

and Prisacariu, V., 2018

Balntas, V., Li, S. and Prisacariu, V., 2018. Relocnet: Continuous metric learning relocalisation using neural nets. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 751-767)

work page 2018
[54]

and Rother, C., 2017

Brachmann, E., Krull, A., Nowozin, S., Shotton, J., Michel, F., Gumhold, S. and Rother, C., 2017. DSAC -differentiable RANSAC for camera localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 6684 -6692). https://github.com/cvlab- dresden/DSAC

work page 2017
[55]

and Fitzgibbon, A., 2013

Shotton, J., Glocker, B., Zach, C., Izadi, S., Criminisi, A. and Fitzgibbon, A., 2013. Scene coordinate regression forests for camera relocalization in RGB -D images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2930-2937)

work page 2013
[56]

and Rother, C., 2018

Brachmann, E. and Rother, C., 2018. Learning less is more - 6d camera localization via 3d surface regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4654 -4662). https://github.com/vislearn/LessMore

work page 2018
[57]

CVPR 2019 workshop on Long -Term Visual Localization https://www.visuallocalization.net/

work page 2019
[58]

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., We yand, T., Andreetto, M. and Adam, H., 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861

work page internal anchor Pith review Pith/arXiv arXiv 2017
[59]

Distilling the Knowledge in a Neural Network

Hinton, G., Vinyals, O. and Dean, J., 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531

work page internal anchor Pith review Pith/arXiv arXiv 2015
[60]

and Sinha, S.N.,

Pittaluga, F., Koppal, S.J., Bing Kang, S. and Sinha, S.N.,

work page
[61]

In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp

Revealing scenes by inverting structure from motion reconstructions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 145-154)

work page
[62]

Style Augmentation: Data Augmentation via Style Randomization

Jackson, P.T., Atapour-Abarghouei, A., Bonner, S., Breckon, T. and Obara, B., 2018. Style Augmentation: Data Augmentation via Style Randomization. arXiv preprint arXiv:1809.05375

work page internal anchor Pith review Pith/arXiv arXiv 2018
[63]

Night-to-Day Image Translation for Retrieval-based Localization

Anoosheh, A., Sattler, T., Timofte, R., Pollefeys, M. and Van Gool, L., 2018. Night -to-Day Image Translation for Retrieval-based Localization. arXiv preprint arXiv:1809.09767

work page internal anchor Pith review Pith/arXiv arXiv 2018
[64]

and Ramisa, A., 2019

Yu, L., Oguz Yazici, V., Liu, X., van de Weijer, J., Cheng, Y. and Ramisa, A., 2019. Learning Metrics from Teachers: Compact Networks for Image Embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2907-2916)

work page 2019
[65]

and Le, Q.V., 2019

Kornblith, S., Shlens, J. and Le, Q.V., 2019. Do better imagenet models transfer better?. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2661-2671)

work page 2019
[66]

Learning Loss for Active Learning

Yoo, D. and Kweon, I.S ., 201 9. Learning Loss for Active Learning. arXiv preprint arXiv: 1905.03677

work page internal anchor Pith review Pith/arXiv arXiv 1905

[1] [1]

and van der Maaten, L., 2018

Mahajan, D., Girshick, R., Ramanathan, V., He, K., Paluri, M., Li, Y., Bharambe, A. and van der Maaten, L., 2018. Exploring the limits of weakly supervised pretraining. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 181-196)

work page 2018

[2] [2]

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

Tan, M. and Le, Q.V., 2019. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. arXiv preprint arXiv:1905.11946

work page internal anchor Pith review Pith/arXiv arXiv 2019

[3] [3]

and Rabinovich, A.,

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V. and Rabinovich, A.,

work page

[4] [4]

In Proceedings of the IEEE conference on computer vision and pattern recognition (pp

Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1-9)

work page

[5] [5]

and Hall, P., 2016, October

Westlake, N., Cai, H. and Hall, P., 2016, October. Detecting people in artwork with CNNs . In European Conference on Computer Vision (pp. 825-841). Springer, Cham

work page 2016

[6] [6]

and Catanzaro, B., 2019

Zhu, Y., Sapra, K., Reda, F.A., Shih, K.J., Newsam, S., Tao, A. and Catanzaro, B., 2019. Improving Semantic Segmentation via Video Propagation and Label Relaxation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 8856-8865)

work page 2019

[7] [7]

and Cipolla, R., 2017

Badrinarayanan, V., Kendall, A. and Cipolla, R., 2017. Segnet: A deep convolutional encoder -decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence, 39(12), pp.2481-2495

work page 2017

[8] [8]

and Cipolla, R., 2015

Kendall, A., Grimes, M. and Cipolla, R., 2015. Posenet: A convolutional network for real -time 6 -dof camera relocalization. In Proceedings of the IEEE international conference on computer vision (pp. 2938 -2946). https://github.com/alexgkendall/caffe-posenet

work page 2015

[9] [9]

and Kobbelt, L., 2012, October

Sattler, T., Leibe, B. and Kobbelt, L., 2012, October. Improving image -based localization by active correspondence search. In European conference on computer vision (pp. 752-765). Springer, Berlin, Heidelberg

work page 2012

[10] [10]

and Kobbelt, L., 2016

Sattler, T ., Leibe, B. and Kobbelt, L., 2016. Efficient & effective prioritized matching for large -scale image -based localization. IEEE transactions on pattern analysis and machine intelligence, 39(9), pp.1744-1756

work page 2016

[11] [11]

and Hu, X., 2017, May

Wu, J., Ma, L. and Hu, X., 2017, May. Delving deep er into convolutional neural networks for camera relocalization . In 2017 IEEE International Conference on Robotics and Automation (ICRA) (pp. 5644-5651). IEEE

work page 2017

[12] [12]

and Cipolla, R., 2017

Kendall, A. and Cipolla, R., 2017. Geometric loss functions for camera pose regression with deep learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 5974-5983)

work page 2017

[13] [13]

and Szeliski, R., 2006, July

Snavely, N., Seitz, S.M. and Szeliski, R., 2006, July. Photo tourism: exploring photo collections in 3D. In ACM transactions on graphics (TOG) (Vol. 25, No. 3, pp. 835 - 846). ACM

work page 2006

[14] [14]

and Frahm, J.M., 2016

Schonberger, J.L. and Frahm, J.M., 2016. Structure -from- motion revisited. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4104-4113)

work page 2016

[15] [15]

and Frahm, J.M., 2016

Schonberger, J.L. and Frahm, J.M., 2016. Struc ture-from- motion revisited. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4104-4113)

work page 2016

[16] [16]

VisualSFM: A visual structure from motion system

Wu, C., 2011. VisualSFM: A visual structure from motion system. http://www. cs. washington. edu/homes/ccwu/vsfm

work page 2011

[17] [17]

and Kahl, F., 2018

Sattler, T., Maddern, W., Toft, C., Torii, A., Hammarstrand, L., Stenborg, E., Safari, D., Okutomi, M., Pollefeys, M., Sivic, J. and Kahl, F., 2018. Benchmarking 6dof outdoor visual localization in changing conditions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 8601-8610)

work page 2018

[18] [18]

and Li, H., 2013

Hartley, R., Trumpf, J., Dai, Y. and Li, H., 2013. Rotation averaging. International journal of computer vision , 103(3), pp.267-305

work page 2013

[19] [19]

Distinctive image features from scale - invariant keypoints

Lowe, D.G., 2004. Distinctive image features from scale - invariant keypoints. International journal of computer vision, 60(2), pp.91-110

work page 2004

[20] [20]

and Rabinovich, A., 2018

DeTone, D., Malisiewicz, T. and Rabinovich, A., 2018. Superpoint: Self -supervised interest point detection and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (pp. 224-236)

work page 2018

[21] [21]

and Bolles, R.C., 1981

Fischler, M.A. and Bolles, R.C., 1981. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6), pp.381-395

work page 1981

[22] [22]

and Sivic, J., 2016

Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T. and Sivic, J., 2016. NetVLAD: CNN architecture for weakly supervised place recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 5297-5307)

work page 2016

[23] [23]

and Larlus, D., 2016, October

Gordo, A., Almazán, J., Revaud, J. and Larlus, D., 2016, October. Deep image retrieval: Learning global representations for image search. In European conference on computer vision (pp. 241-257). Springer, Cham

work page 2016

[24] [24]

and Philbin, J., 2016, October

Weyand, T., Kostrikov, I. and Philbin, J., 2016, October. Planet-photo geolocation with convolutional neural networks. In European Conference on Computer Vision (pp. 37-55). Springer, Cham

work page 2016

[25] [25]

and Dymczyk, M.,

Sarlin, P.E., Cadena, C., Siegwart, R. and Dymczyk, M.,

work page

[26] [26]

In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp

From coarse to fine: Robust hierarchical localization at large scale. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 12716 - 12725). https://github.com/ethz-asl/hfnet

work page

[27] [27]

Understanding the Limitations of CNN-based Absolute Camera Pose Regression

Sattler, T., Zhou, Q., Pollefeys, M. and Leal-Taixe, L., 2019. Understanding the Limitations of CNN -based Absolute Camera Pose Regression. arXiv preprint arXiv:1903.07504

work page internal anchor Pith review Pith/arXiv arXiv 2019

[28] [28]

and Moreno -Noguer, F., 2015

Simo-Serra, E., Trulls, E., Ferraz, L., Kokkinos, I., Fua, P. and Moreno -Noguer, F., 2015. Discriminative learning of deep convolutional feature point descriptors. In Proceedings of the IEEE Interna tional Conference on Computer Vision (pp. 118-126)

work page 2015

[29] [29]

and Criminisi, A., 2013, October

Glocker, B., Izadi, S., Shotton, J. and Criminisi, A., 2013, October. Real -time RGB-D camera relocalization. In 2013 IEEE International Symposium on Mixed and Augmented Reality (ISMAR) (pp. 173-179). IEEE

work page 2013

[30] [30]

Bayesian Convolutional Neural Networks with Bernoulli Approximate Variational Inference

Gal, Y. and Ghahramani, Z., 2015. Bayesian convolutional neural networks with Bernoulli approximate variational inference. arXiv preprint arXiv:1506.02158

work page internal anchor Pith review Pith/arXiv arXiv 2015

[31] [31]

and Cipolla, R., 2016, May

Kendall, A. and Cipolla, R., 2016, May. Modelling uncertainty in deep learning for camera relocali zation. In 2016 IEEE international conference on Robotics and Automation (ICRA) (pp. 4762 -4769). IEEE. https://github.com/alexgkendall/caffe-posenet

work page 2016

[32] [32]

and Cremers, D., 2017

Walch, F., Hazirbas, C., Leal -Taixe, L., Sattler, T., Hilsenbeck, S. and Cremers, D., 2017. Image -based localization using lstms for structured feature correlation. In Proceedings of the IEEE International Conference on Computer Vision (pp. 627-637)

work page 2017

[33] [33]

and Ra htu, E., 2017

Melekhov, I., Ylioinas, J., Kannala, J. and Ra htu, E., 2017. Image-based localization using hourglass networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 879 -886). https://github.com/AaltoVision/camera-relocalisation

work page 2017

[34] [34]

and Deng, J., 2016, October

Newell, A., Yang, K. and Deng, J., 2016, October. Stacked hourglass networks for human pose estimation. In European Conference on Computer Vision (pp. 483 -499). Springer, Cham

work page 2016

[35] [35]

and Sun, J., 2016

He, K., Zhang, X., Ren, S. and Sun, J., 2016. Deep residual learning for image recognition . In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778)

work page 2016

[36] [36]

and Burgard, W., 2017, September

Naseer, T. and Burgard, W., 2017, September. Deep regression for monocular camera -based 6 -dof global localization in outdoor envi ronments. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (pp. 1525-1530). IEEE

work page 2017

[37] [37]

and Sun, J., 2015

Zhang, X., Zou, J., He, K. and Sun, J., 2015. Accelerating very deep convolutional networks for classification and detection. IEEE transacti ons on pattern analysis and machine intelligence, 38(10), pp.1943-1955

work page 2015

[38] [38]

Kendall, A., Gal, Y., & Cipolla, R. (2018). Multi -task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 7482-7491)

work page 2018

[39] [39]

and Kautz, J., 2018

Brahmbhatt, S., Gu, J., Kim, K., Hays, J. and Kautz, J., 2018. Geometry-aware learning of maps for camera localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2616-2625)

work page 2018

[40] [40]

and Cremers, D., 2017

Engel, J., Koltun, V. and Cremers, D., 2017. Direct sparse odometry. IEEE transactions on pattern analysis and machine intelligence, 40(3), pp.611-625

work page 2017

[41] [41]

and Cremers, D., 2013

Engel, J., Sturm, J. and Cremers, D., 2013. Semi-dense visual odometry for a monocular camera. In Proceedings of the IEEE international conference on computer vision (pp. 1449- 1456). https://github.com/NVlabs/geomapnet

work page 2013

[42] [42]

Rotations, quaternions, and double groups

Altmann, S.L., 2005. Rotations, quaternions, and double groups. Courier Corporation

work page 2005

[43] [43]

and Burgard, W., 2018, May

Valada, A., Radwan, N. and Burgard, W., 2018, May. Deep auxiliary learning for visual localization and odometry. In 2018 IEEE International Conference on R obotics and Automation (ICRA) (pp. 6939-6946). IEEE

work page 2018

[44] [44]

and Hinton, G.E., 2010

Nair, V. and Hinton, G.E., 2010. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10) (pp. 807-814)

work page 2010

[45] [45]

Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)

Clevert, D.A., Unterthiner, T. and Hochreiter, S., 2015. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289

work page internal anchor Pith review Pith/arXiv arXiv 2015

[46] [46]

and Burgard, W., 2018

Radwan, N., Valada, A. and Burgard, W., 2018. Vlocnet++: Deep multitask learning for semantic visual localization and odometry. IEEE Robotics and Automation Letters , 3(4), pp.4407-4414

work page 2018

[47] [47]

Deep Global-Relative Networks for End-to-End 6-DoF Visual Localization and Odometry

Lin, Y., Liu, Z., Huang, J., Wang, C., Du, G., Bai, J., Lian, S. and Huang, B., 2018. Deep Global -Relative Networks for End-to-End 6-DoF Visual Localization and Odometry. arXiv preprint arXiv:1812.07869

work page internal anchor Pith review Pith/arXiv arXiv 2018

[48] [48]

and Wen, H.,

Clark, R., Wang, S., Markham, A., Trigoni, N. and Wen, H.,

work page

[49] [49]

In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp

Vidloc: A deep spatio-temporal model for 6-dof video- clip relocalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp . 6856 - 6864)

work page

[50] [50]

and Shammah, S., 2017, August

Shalev-Shwartz, S., Shamir, O. and Shammah, S., 2017, August. Failures of gradient -based deep learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70 (pp. 3067-3075). JMLR. org

work page 2017

[51] [51]

and Mayol -Cuevas, W., 2018

Contreras, L. and Mayol -Cuevas, W., 2018. Towards CNN map representation and compression for camera relocalisation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (pp. 292-299)

work page 2018

[52] [52]

and Kannala, J., 2017

Laskar, Z., Melekhov, I., Kalia, S. and Kannala, J., 2017. Camera relocalization by computing pairwise relative poses using convolutional neural network. In Proceedings of the IEEE International Conference on Computer Vision (pp. 929-938). https://github.com/AaltoVision/camera- relocalisation

work page 2017

[53] [53]

and Prisacariu, V., 2018

Balntas, V., Li, S. and Prisacariu, V., 2018. Relocnet: Continuous metric learning relocalisation using neural nets. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 751-767)

work page 2018

[54] [54]

and Rother, C., 2017

Brachmann, E., Krull, A., Nowozin, S., Shotton, J., Michel, F., Gumhold, S. and Rother, C., 2017. DSAC -differentiable RANSAC for camera localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 6684 -6692). https://github.com/cvlab- dresden/DSAC

work page 2017

[55] [55]

and Fitzgibbon, A., 2013

Shotton, J., Glocker, B., Zach, C., Izadi, S., Criminisi, A. and Fitzgibbon, A., 2013. Scene coordinate regression forests for camera relocalization in RGB -D images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2930-2937)

work page 2013

[56] [56]

and Rother, C., 2018

Brachmann, E. and Rother, C., 2018. Learning less is more - 6d camera localization via 3d surface regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4654 -4662). https://github.com/vislearn/LessMore

work page 2018

[57] [57]

CVPR 2019 workshop on Long -Term Visual Localization https://www.visuallocalization.net/

work page 2019

[58] [58]

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., We yand, T., Andreetto, M. and Adam, H., 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861

work page internal anchor Pith review Pith/arXiv arXiv 2017

[59] [59]

Distilling the Knowledge in a Neural Network

Hinton, G., Vinyals, O. and Dean, J., 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531

work page internal anchor Pith review Pith/arXiv arXiv 2015

[60] [60]

and Sinha, S.N.,

Pittaluga, F., Koppal, S.J., Bing Kang, S. and Sinha, S.N.,

work page

[61] [61]

In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp

Revealing scenes by inverting structure from motion reconstructions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 145-154)

work page

[62] [62]

Style Augmentation: Data Augmentation via Style Randomization

Jackson, P.T., Atapour-Abarghouei, A., Bonner, S., Breckon, T. and Obara, B., 2018. Style Augmentation: Data Augmentation via Style Randomization. arXiv preprint arXiv:1809.05375

work page internal anchor Pith review Pith/arXiv arXiv 2018

[63] [63]

Night-to-Day Image Translation for Retrieval-based Localization

Anoosheh, A., Sattler, T., Timofte, R., Pollefeys, M. and Van Gool, L., 2018. Night -to-Day Image Translation for Retrieval-based Localization. arXiv preprint arXiv:1809.09767

work page internal anchor Pith review Pith/arXiv arXiv 2018

[64] [64]

and Ramisa, A., 2019

Yu, L., Oguz Yazici, V., Liu, X., van de Weijer, J., Cheng, Y. and Ramisa, A., 2019. Learning Metrics from Teachers: Compact Networks for Image Embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2907-2916)

work page 2019

[65] [65]

and Le, Q.V., 2019

Kornblith, S., Shlens, J. and Le, Q.V., 2019. Do better imagenet models transfer better?. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2661-2671)

work page 2019

[66] [66]

Learning Loss for Active Learning

Yoo, D. and Kweon, I.S ., 201 9. Learning Loss for Active Learning. arXiv preprint arXiv: 1905.03677

work page internal anchor Pith review Pith/arXiv arXiv 1905