pith. sign in

arxiv: 2305.15404 · v2 · submitted 2023-05-24 · 💻 cs.CV

RoMa: Robust Dense Feature Matching

Pith reviewed 2026-05-24 08:54 UTC · model grok-4.3

classification 💻 cs.CV
keywords dense feature matchingrobust matchingDINOv2transformer decoderfeature pyramidcomputer visionimage correspondencemultimodal decoding
0
0 comments X

The pith

RoMa combines frozen DINOv2 features with ConvNet fine features and anchor-probability decoding to achieve robust dense feature matching.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes RoMa as a robust model for dense feature matching that can handle challenging real-world changes between images. It does this by using large-scale pretrained DINOv2 features that are robust but coarse, then adding specialized ConvNet features for precision and a transformer decoder that predicts anchor probabilities to handle cases with multiple possible matches. An improved loss helps with training. The result is a new state-of-the-art performance, including a 36% improvement on the WxBS benchmark, which would matter for any vision system needing reliable point correspondences under varying conditions like lighting or viewpoint shifts.

Core claim

RoMa establishes robust dense correspondences by leveraging frozen DINOv2 features combined with specialized ConvNet fine features to form a precisely localizable feature pyramid, decoded via a tailored transformer that predicts anchor probabilities to express multimodality, and trained with regression-by-classification followed by robust regression.

What carries the argument

The feature pyramid of frozen DINOv2 features plus ConvNet fine features, together with the transformer match decoder that predicts anchor probabilities.

If this is right

  • Dense matching becomes more reliable under extreme appearance and geometric changes.
  • Downstream tasks such as 3D reconstruction gain from higher quality correspondences.
  • The model generalizes better across datasets without per-dataset retraining.
  • Multimodal predictions allow better handling of ambiguous regions in images.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar fusion strategies could be tested with other large pretrained vision models.
  • The anchor probability approach might apply to other correspondence problems like optical flow.
  • Performance on WxBS suggests potential for real-world applications in outdoor or seasonal monitoring.

Load-bearing premise

The combination of frozen DINOv2 features, ConvNet fine features, anchor-probability decoding, and the regression loss will generalize to unseen real-world image distributions without dataset-specific adjustments.

What would settle it

Evaluation on a new set of image pairs with novel appearance changes, such as extreme weather or unseen object categories, where the accuracy does not exceed previous methods by a large margin.

Figures

Figures reproduced from arXiv: 2305.15404 by Georg B\"okman, Johan Edstedt, M{\aa}rten Wadenb\"ack, Michael Felsberg, Qiyu Sun.

Figure 1
Figure 1. Figure 1: RoMa is robust, i.e., able to match under extreme changes. We propose RoMa, a model for dense feature matching that is robust to a wide variety of challenging real-world changes in scale, illumination, viewpoint, and texture. We show correspondences estimated by RoMa on the extremely challenging benchmark WxBS [35], where most previous methods fail, and on which we set a new state-of-the-art with an improv… view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of our robust approach RoMa. Our contributions are shown with green highlighting and a checkmark, while previous approaches are indicated with gray highlights and a cross. Our first contribution is using a frozen foundation model for coarse features, compared to fine-tuning or training from scratch. DINOv2 lacks fine features, which are needed for accurate correspondences. To tackle this, we c… view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of localizability of matches. At infinite resolution the match distribution can be seen as a 2D surface (il￾lustrated as 1D lines in the figure), however at a coarser scale s this distribution becomes blurred due to motion boundaries. This means it is necessary to both use a model and an objective function capable of representing multimodal distributions. encodings. By restricting the model to… view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of loss gradients. We use the general￾ized Charbonnier [3] loss for refinement, which locally matches L2 gradients, but globally decays with |x| −1/2 toward zero. is unimodal locally. However, if this initial choice is far out￾side the support of the distribution, using a non-robust loss function is problematic. It is therefore motivated to use a robust regression loss for this stage. Loss formu… view at source ↗
Figure 5
Figure 5. Figure 5: Evaluation of frozen features. From top to bottom: Image pair, VGG19 matches, RN50 matches, DINOv2 matches, RoMa matches. DINOv2 is significantly more robust than the VGG19 and RN50. Quantitative results are presented in [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison. RoMa is significantly more robust to extreme changes in viewpoint and illumination than DKM. 3 [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
read the original abstract

Feature matching is an important computer vision task that involves estimating correspondences between two images of a 3D scene, and dense methods estimate all such correspondences. The aim is to learn a robust model, i.e., a model able to match under challenging real-world changes. In this work, we propose such a model, leveraging frozen pretrained features from the foundation model DINOv2. Although these features are significantly more robust than local features trained from scratch, they are inherently coarse. We therefore combine them with specialized ConvNet fine features, creating a precisely localizable feature pyramid. To further improve robustness, we propose a tailored transformer match decoder that predicts anchor probabilities, which enables it to express multimodality. Finally, we propose an improved loss formulation through regression-by-classification with subsequent robust regression. We conduct a comprehensive set of experiments that show that our method, RoMa, achieves significant gains, setting a new state-of-the-art. In particular, we achieve a 36% improvement on the extremely challenging WxBS benchmark. Code is provided at https://github.com/Parskatt/RoMa

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes RoMa for robust dense feature matching. It combines frozen DINOv2 features with specialized ConvNet fine features to create a localizable feature pyramid, employs a transformer decoder that predicts anchor probabilities to handle multimodality, and uses an improved regression-by-classification loss followed by robust regression. Comprehensive experiments are reported to show significant gains over prior methods, including a 36% improvement on the challenging WxBS benchmark, establishing a new state-of-the-art. Code is released at the provided GitHub link.

Significance. If the performance claims hold under scrutiny, the work demonstrates a practical way to leverage large-scale pretrained foundation models for improved robustness in dense matching without training from scratch. The open-source code is a clear strength that supports reproducibility and further research.

major comments (2)
  1. [Abstract / Experiments] Abstract and Experiments section: the central claim of a 36% improvement and new SOTA on WxBS is load-bearing, yet the abstract (and by extension the reported experiments) provides no detail on baseline implementations, error bars, statistical significance, or data splits. This directly affects assessment of whether the reported gains are reliable.
  2. [Method / Experiments] Method and Experiments: the robustness claim rests on the specific fusion of frozen DINOv2, added ConvNet features, anchor-probability decoding, and the loss producing stable performance without per-dataset retuning. No ablation or evaluation on fresh distributions outside the reported benchmarks is described to test this assumption.
minor comments (2)
  1. Figure captions and legends should explicitly state the metrics and baselines shown in all quantitative plots for immediate readability.
  2. Notation for the anchor-probability output and the subsequent regression step should be introduced with a single consistent equation reference rather than scattered descriptions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and positive assessment of the work's significance and reproducibility. We respond to each major comment below.

read point-by-point responses
  1. Referee: [Abstract / Experiments] Abstract and Experiments section: the central claim of a 36% improvement and new SOTA on WxBS is load-bearing, yet the abstract (and by extension the reported experiments) provides no detail on baseline implementations, error bars, statistical significance, or data splits. This directly affects assessment of whether the reported gains are reliable.

    Authors: We agree the abstract is concise and omits these specifics. The experiments section reports results using official baseline implementations and standard dataset protocols for splits. Error bars and significance tests are uncommon in this literature, but gains are consistent across benchmarks. We will revise the abstract to briefly note the evaluation setup, baselines, and data splits, and add a short discussion of reliability in the experiments section. revision: yes

  2. Referee: [Method / Experiments] Method and Experiments: the robustness claim rests on the specific fusion of frozen DINOv2, added ConvNet features, anchor-probability decoding, and the loss producing stable performance without per-dataset retuning. No ablation or evaluation on fresh distributions outside the reported benchmarks is described to test this assumption.

    Authors: The robustness without per-dataset retuning is supported by consistent SOTA results across the diverse reported benchmarks, which include extreme variations. While evaluations on entirely new distributions are not included, the current benchmarks test the components under challenging conditions. We will add a discussion of this scope in the experiments section and expand component ablations where feasible. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical architecture validated on external benchmarks

full rationale

The paper presents RoMa as an empirical construction: frozen DINOv2 features combined with added ConvNet fine features, a transformer decoder outputting anchor probabilities, and regression-by-classification loss. All performance claims (including the 36% WxBS gain) are reported from direct experiments on standard benchmarks. No equations, predictions, or first-principles derivations are given that reduce by construction to fitted parameters or self-citations; the central claims rest on measured generalization rather than definitional equivalence.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of the proposed hybrid feature pyramid and decoder; the main external dependency is the robustness property of the frozen DINOv2 backbone taken from prior work.

free parameters (1)
  • training hyperparameters and loss weights
    Standard ML training choices that affect final performance but are not enumerated in the abstract.
axioms (1)
  • domain assumption DINOv2 features remain significantly more robust than local features trained from scratch under real-world changes
    Invoked in the abstract as the starting point for the method.

pith-pipeline@v0.9.0 · 5736 in / 1349 out tokens · 31304 ms · 2026-05-24T08:54:10.678600+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Improving Local Feature Matching by Entropy-inspired Scale Adaptability and Flow-endowed Local Consistency

    cs.CV 2026-04 unverdicted novelty 4.0

    A semi-dense image matching pipeline adds scale adaptability via score-matrix hints at the coarse stage and local flow consistency via gradient loss at the fine stage.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · cited by 1 Pith paper · 2 internal anchors

  1. [1]

    HPatches: A benchmark and evaluation of handcrafted and learned local descriptors

    Vassileios Balntas, Karel Lenc, Andrea Vedaldi, and Krys- tian Mikolajczyk. HPatches: A benchmark and evaluation of handcrafted and learned local descriptors. In Proceedings of the IEEE conference on computer vision and pattern recog- nition, pages 5173–5182, 2017

  2. [2]

    MAGSAC++, a fast, reliable and accurate robust estimator

    Daniel Barath, Jana Noskova, Maksym Ivashechkin, and Jiri Matas. MAGSAC++, a fast, reliable and accurate robust estimator. In Conference on Computer Vision and Pattern Recognition, 2020. 8

  3. [3]

    A general and adaptive robust loss func- tion

    Jonathan T Barron. A general and adaptive robust loss func- tion. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 4331–4339,

  4. [4]

    Surf: Speeded up robust features

    Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up robust features. InEuropean conference on com- puter vision, pages 404–417. Springer, 2006. 3

  5. [5]

    The robust estimation of multiple motions: Parametric and piecewise-smooth flow fields

    Michael J Black and Paul Anandan. The robust estimation of multiple motions: Parametric and piecewise-smooth flow fields. Computer vision and image understanding, 63(1):75– 104, 1996. 3

  6. [6]

    On the unification of line processes, outlier rejection, and robust statistics with applications in early vision

    Michael J Black and Anand Rangarajan. On the unification of line processes, outlier rejection, and robust statistics with applications in early vision. International journal of com- puter vision, 19(1):57–91, 1996. 3

  7. [7]

    A case for using rotation invariant features in state of the art feature matchers

    Georg B ¨okman and Fredrik Kahl. A case for using rotation invariant features in state of the art feature matchers. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5110–5119, 2022. 3

  8. [8]

    On the Opportunities and Risks of Foundation Models

    Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Alt- man, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021. 3

  9. [9]

    Large scale joint semantic re-localisation and scene understanding via globally unique instance coor- dinate regression

    Ignas Budvytis, Marvin Teichmann, Tomas V ojir, and Roberto Cipolla. Large scale joint semantic re-localisation and scene understanding via globally unique instance coor- dinate regression. In Proceedings of the British Machine Vi- sion Conference (BMVC) , pages 86.1–86.13. BMV A Press,

  10. [10]

    Improving transformer-based image matching by cascaded capturing spatially informa- tive keypoints

    Chenjie Cao and Yanwei Fu. Improving transformer-based image matching by cascaded capturing spatially informa- tive keypoints. In Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision (ICCV) , pages 12129–12139, 2023. 7

  11. [11]

    Emerg- ing properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In Pro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 3

  12. [12]

    ASpanFormer: Detector-free image matching with adaptive span transformer

    Hongkai Chen, Zixin Luo, Lei Zhou, Yurun Tian, Mingmin Zhen, Tian Fang, David Mckinnon, Yanghai Tsin, and Long Quan. ASpanFormer: Detector-free image matching with adaptive span transformer. InProc. European Conference on Computer Vision (ECCV), 2022. 3, 7, 8

  13. [13]

    Scannet: Richly-annotated 3d reconstructions of indoor scenes

    Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017. 7, 8

  14. [14]

    Superpoint: Self-supervised interest point detection and description

    Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops , pages 224–236, 2018. 3, 7

  15. [15]

    BERT: Pre-training of deep bidirectional trans- formers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional trans- formers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the As- sociation for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages 4171–4186, Minnea...

  16. [16]

    D2-Net: A Trainable CNN for Joint Detection and Description of Lo- cal Features

    Mihai Dusmanu, Ignacio Rocco, Tomas Pajdla, Marc Polle- feys, Josef Sivic, Akihiko Torii, and Torsten Sattler. D2-Net: A Trainable CNN for Joint Detection and Description of Lo- cal Features. In Proceedings of the 2019 IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2019. 4

  17. [17]

    DKM: Dense kernelized feature matching for geometry estimation

    Johan Edstedt, Ioannis Athanasiadis, M ˚arten Wadenb ¨ack, and Michael Felsberg. DKM: Dense kernelized feature matching for geometry estimation. In IEEE Conference on Computer Vision and Pattern Recognition, 2023. 1, 2, 3, 4, 5, 6, 7, 8

  18. [18]

    Channel smoothing: Efficient robust smoothing of low-level signal features

    Michael Felsberg, P-E Forssen, and H Scharr. Channel smoothing: Efficient robust smoothing of low-level signal features. IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 28(2):209–222, 2006. 3

  19. [19]

    Wasserstein distances for stereo disparity estimation

    Divyansh Garg, Yan Wang, Bharath Hariharan, Mark Camp- bell, Kilian Q Weinberger, and Wei-Lun Chao. Wasserstein distances for stereo disparity estimation. Advances in Neural Information Processing Systems, 33:22517–22529, 2020. 3

  20. [20]

    Neural reprojection error: Merging feature learning and cam- era pose estimation

    Hugo Germain, Vincent Lepetit, and Guillaume Bourmaud. Neural reprojection error: Merging feature learning and cam- era pose estimation. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 414–423, 2021. 3

  21. [21]

    SiLK: Simple Learned Keypoints

    Pierre Gleize, Weiyao Wang, and Matt Feiszli. SiLK: Simple Learned Keypoints. In ICCV, 2023. 7

  22. [22]

    Pre- dicting disparity distributions

    Gustav H ¨ager, Mikael Persson, and Michael Felsberg. Pre- dicting disparity distributions. In 2021 IEEE International Conference on Robotics and Automation (ICRA) , pages 4363–4369. IEEE, 2021. 3

  23. [23]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 4

  24. [24]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16000–16009, 2022. 1

  25. [25]

    Image matching chal- lenge 2022, 2022

    Addison Howard, Eduard Trulls, Kwang Moo Yi, Dmitry Mishkin, Sohier Dane, and Yuhe Jin. Image matching chal- lenge 2022, 2022. 7, 8 9

  26. [26]

    The structure of images

    Jan J Koenderink. The structure of images. Biological cy- bernetics, 50(5):363–370, 1984. 1

  27. [27]

    Hierarchical scene coordinate classification and regression for visual localization

    Xiaotian Li, Shuzhe Wang, Yi Zhao, Jakob Verbeek, and Juho Kannala. Hierarchical scene coordinate classification and regression for visual localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11983–11992, 2020. 3

  28. [28]

    Megadepth: Learning single- view depth prediction from internet photos

    Zhengqi Li and Noah Snavely. Megadepth: Learning single- view depth prediction from internet photos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2041–2050, 2018. 4, 7, 8

  29. [29]

    Could giant pre-trained image models extract universal representations? Advances in Neu- ral Information Processing Systems, 35:8332–8346, 2022

    Yutong Lin, Ze Liu, Zheng Zhang, Han Hu, Nanning Zheng, Stephen Lin, and Yue Cao. Could giant pre-trained image models extract universal representations? Advances in Neu- ral Information Processing Systems, 35:8332–8346, 2022. 1

  30. [30]

    Scale-space theory: A basic tool for analyz- ing structures at different scales.Journal of applied statistics, 21(1-2):225–270, 1994

    Tony Lindeberg. Scale-space theory: A basic tool for analyz- ing structures at different scales.Journal of applied statistics, 21(1-2):225–270, 1994. 1

  31. [31]

    LightGlue: Local Feature Matching at Light Speed

    Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Polle- feys. LightGlue: Local Feature Matching at Light Speed. In ICCV, 2023. 7

  32. [32]

    Camliflow: bidirectional camera-lidar fusion for joint optical flow and scene flow estimation

    Haisong Liu, Tao Lu, Yihui Xu, Jia Liu, Wenjie Li, and Lijun Chen. Camliflow: bidirectional camera-lidar fusion for joint optical flow and scene flow estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5791–5801, 2022. 3

  33. [33]

    Distinctive image features from scale- invariant keypoints

    David G Lowe. Distinctive image features from scale- invariant keypoints. International journal of computer vi- sion, 60(2):91–110, 2004. 3

  34. [34]

    Dgc-net: Dense ge- ometric correspondence network

    Iaroslav Melekhov, Aleksei Tiulpin, Torsten Sattler, Marc Pollefeys, Esa Rahtu, and Juho Kannala. Dgc-net: Dense ge- ometric correspondence network. In 2019 IEEE Winter Con- ference on Applications of Computer Vision (WACV), pages 1034–1042. IEEE, 2019. 3

  35. [35]

    WxBS: Wide Baseline Stereo Generalizations

    Dmytro Mishkin, Jiri Matas, Michal Perdoch, and Karel Lenc. WxBS: Wide Baseline Stereo Generalizations. InPro- ceedings of the British Machine Vision Conference. BMV A,

  36. [36]

    Pats: Patch area transportation with subdivision for local feature matching

    Junjie Ni, Yijin Li, Zhaoyang Huang, Hongsheng Li, Hujun Bao, Zhaopeng Cui, and Guofeng Zhang. Pats: Patch area transportation with subdivision for local feature matching. In The IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), 2023. 1, 3, 7

  37. [37]

    Maxime Oquab, Timoth ´ee Darcet, Theo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Rus- sell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang- Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nico- las Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patri...

  38. [38]

    Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning (Adaptive Com- putation and Machine Learning). The MIT Press, 2005. 4, 1

  39. [39]

    R2d2: Reliable and repeatable detec- tor and descriptor

    Jerome Revaud, Cesar De Souza, Martin Humenberger, and Philippe Weinzaepfel. R2d2: Reliable and repeatable detec- tor and descriptor. Advances in neural information process- ing systems, 32:12405–12415, 2019. 3

  40. [40]

    From coarse to fine: Robust hierarchical localization at large scale

    Paul-Edouard Sarlin, Cesar Cadena, Roland Siegwart, and Marcin Dymczyk. From coarse to fine: Robust hierarchical localization at large scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 12716–12725, 2019. 1, 8

  41. [41]

    Superglue: Learning feature matching with graph neural networks

    Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superglue: Learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4938–4947, 2020. 1, 3, 7, 8

  42. [42]

    Back to the feature: Learning robust camera localization from pixels to pose

    Paul-Edouard Sarlin, Ajaykumar Unagar, Mans Larsson, Hugo Germain, Carl Toft, Viktor Larsson, Marc Pollefeys, Vincent Lepetit, Lars Hammarstrand, Fredrik Kahl, et al. Back to the feature: Learning robust camera localization from pixels to pose. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 3247–3257, 2021. 4

  43. [43]

    Structure- from-motion revisited

    Johannes L Schonberger and Jan-Michael Frahm. Structure- from-motion revisited. In Proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 4104–4113, 2016. 1

  44. [44]

    LoFTR: Detector-free local feature matching with transformers

    Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xiaowei Zhou. LoFTR: Detector-free local feature matching with transformers. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 8922–8931, 2021. 1, 3, 7, 8

  45. [45]

    Inloc: Indoor visual localization with dense matching and view synthesis

    Hajime Taira, Masatoshi Okutomi, Torsten Sattler, Mircea Cimpoi, Marc Pollefeys, Josef Sivic, Tomas Pajdla, and Ak- ihiko Torii. Inloc: Indoor visual localization with dense matching and view synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 7199–7209, 2018. 8

  46. [46]

    Quadtree attention for vision transformers

    Shitao Tang, Jiahui Zhang, Siyu Zhu, and Ping Tan. Quadtree attention for vision transformers. In International Confer- ence on Learning Representations, 2022. 3, 7

  47. [47]

    Prior guided feature enrich- ment network for few-shot segmentation

    Zhuotao Tian, Hengshuang Zhao, Michelle Shu, Zhicheng Yang, Ruiyu Li, and Jiaya Jia. Prior guided feature enrich- ment network for few-shot segmentation. IEEE transactions on pattern analysis and machine intelligence , 44(2):1050– 1065, 2020. 1

  48. [48]

    Regression by classification

    Lu ´ıs Torgo and Jo ˜ao Gama. Regression by classification. In Advances in Artificial Intelligence , pages 51–60, Berlin, Heidelberg, 1996. Springer Berlin Heidelberg. 3

  49. [49]

    GOCor: Bringing Globally Optimized Correspon- dence V olumes into Your Neural Network.Advances in Neu- ral Information Processing Systems, 33, 2020

    Prune Truong, Martin Danelljan, Luc V Gool, and Radu Timofte. GOCor: Bringing Globally Optimized Correspon- dence V olumes into Your Neural Network.Advances in Neu- ral Information Processing Systems, 33, 2020. 1

  50. [50]

    GLU- Net: Global-local universal network for dense flow and cor- respondences

    Prune Truong, Martin Danelljan, and Radu Timofte. GLU- Net: Global-local universal network for dense flow and cor- respondences. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 6258– 6268, 2020. 3

  51. [51]

    Learning accurate dense correspondences and when 10 to trust them

    Prune Truong, Martin Danelljan, Luc Van Gool, and Radu Timofte. Learning accurate dense correspondences and when 10 to trust them. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 5714– 5724, 2021. 3

  52. [52]

    PDC-Net+: Enhanced Probabilistic Dense Corre- spondence Network

    Prune Truong, Martin Danelljan, Radu Timofte, and Luc Van Gool. PDC-Net+: Enhanced Probabilistic Dense Corre- spondence Network. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023. 1, 4, 6, 7, 8

  53. [53]

    Tyszkiewicz, Pascal Fua, and Eduard Trulls

    Michal J. Tyszkiewicz, Pascal Fua, and Eduard Trulls. DISK: learning local features with policy gradient. In NeurIPS,

  54. [54]

    Proper reuse of image classification features im- proves object detection

    Cristina Vasconcelos, Vighnesh Birodkar, and Vincent Du- moulin. Proper reuse of image classification features im- proves object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 13628–13637, 2022. 1

  55. [55]

    MatchFormer: Interleaving attention in transformers for feature matching

    Qing Wang, Jiaming Zhang, Kailun Yang, Kunyu Peng, and Rainer Stiefelhagen. MatchFormer: Interleaving attention in transformers for feature matching. In Asian Conference on Computer Vision, 2022. 7

  56. [56]

    Masked feature predic- tion for self-supervised visual pre-training

    Chen Wei, Haoqi Fan, Saining Xie, Chao-Yuan Wu, Alan Yuille, and Christoph Feichtenhofer. Masked feature predic- tion for self-supervised visual pre-training. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14668–14678, 2022. 1

  57. [57]

    Weiss and Nitin Indurkhya

    Sholom M. Weiss and Nitin Indurkhya. Rule-based regres- sion. In Proceedings of the 13th International Joint Confer- ence on Artificial Intelligence. Chamb´ery, France, August 28 - September 3, 1993, pages 1072–1078. Morgan Kaufmann,

  58. [58]

    Weiss and Nitin Indurkhya

    Sholom M. Weiss and Nitin Indurkhya. Rule-based machine learning methods for functional prediction. J. Artif. Intell. Res., 3:383–403, 1995. 3

  59. [59]

    Andrew P. Witkin. Scale space filtering. Proc. 8th Inter- national Joint on Artificial Intelligence , pages 1091–1022,

  60. [60]

    Revealing the dark secrets of masked im- age modeling

    Zhenda Xie, Zigang Geng, Jingcheng Hu, Zheng Zhang, Han Hu, and Yue Cao. Revealing the dark secrets of masked im- age modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14475– 14485, 2023. 2

  61. [61]

    ASTR: Adaptive spot-guided transformer for consistent local feature matching

    Jiahuan Yu, Jiahao Chang, Jianfeng He, Tianzhu Zhang, Jiyang Yu, and Wu Feng. ASTR: Adaptive spot-guided transformer for consistent local feature matching. In The IEEE/CVF Computer Vision and Pattern Recognition Con- ference (CVPR), 2023. 7

  62. [62]

    ibot: Image bert pre-training with online tokenizer

    Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. ibot: Image bert pre-training with online tokenizer. InInternational Conference on Learn- ing Representations, 2022. 1, 3

  63. [63]

    PMatch: Paired masked image modeling for dense geometric matching

    Shengjie Zhu and Xiaoming Liu. PMatch: Paired masked image modeling for dense geometric matching. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. 3, 7 11 RoMa: Robust Dense Feature Matching Supplementary Material In this supplementary material, we provide further de- tails and qualitative examples that could n...