RoMa: Robust Dense Feature Matching

Georg B\"okman; Johan Edstedt; M{\aa}rten Wadenb\"ack; Michael Felsberg; Qiyu Sun

arxiv: 2305.15404 · v2 · submitted 2023-05-24 · 💻 cs.CV

RoMa: Robust Dense Feature Matching

Johan Edstedt , Qiyu Sun , Georg B\"okman , M{\aa}rten Wadenb\"ack , Michael Felsberg This is my paper

Pith reviewed 2026-05-24 08:54 UTC · model grok-4.3

classification 💻 cs.CV

keywords dense feature matchingrobust matchingDINOv2transformer decoderfeature pyramidcomputer visionimage correspondencemultimodal decoding

0 comments

The pith

RoMa combines frozen DINOv2 features with ConvNet fine features and anchor-probability decoding to achieve robust dense feature matching.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes RoMa as a robust model for dense feature matching that can handle challenging real-world changes between images. It does this by using large-scale pretrained DINOv2 features that are robust but coarse, then adding specialized ConvNet features for precision and a transformer decoder that predicts anchor probabilities to handle cases with multiple possible matches. An improved loss helps with training. The result is a new state-of-the-art performance, including a 36% improvement on the WxBS benchmark, which would matter for any vision system needing reliable point correspondences under varying conditions like lighting or viewpoint shifts.

Core claim

RoMa establishes robust dense correspondences by leveraging frozen DINOv2 features combined with specialized ConvNet fine features to form a precisely localizable feature pyramid, decoded via a tailored transformer that predicts anchor probabilities to express multimodality, and trained with regression-by-classification followed by robust regression.

What carries the argument

The feature pyramid of frozen DINOv2 features plus ConvNet fine features, together with the transformer match decoder that predicts anchor probabilities.

If this is right

Dense matching becomes more reliable under extreme appearance and geometric changes.
Downstream tasks such as 3D reconstruction gain from higher quality correspondences.
The model generalizes better across datasets without per-dataset retraining.
Multimodal predictions allow better handling of ambiguous regions in images.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar fusion strategies could be tested with other large pretrained vision models.
The anchor probability approach might apply to other correspondence problems like optical flow.
Performance on WxBS suggests potential for real-world applications in outdoor or seasonal monitoring.

Load-bearing premise

The combination of frozen DINOv2 features, ConvNet fine features, anchor-probability decoding, and the regression loss will generalize to unseen real-world image distributions without dataset-specific adjustments.

What would settle it

Evaluation on a new set of image pairs with novel appearance changes, such as extreme weather or unseen object categories, where the accuracy does not exceed previous methods by a large margin.

Figures

Figures reproduced from arXiv: 2305.15404 by Georg B\"okman, Johan Edstedt, M{\aa}rten Wadenb\"ack, Michael Felsberg, Qiyu Sun.

**Figure 1.** Figure 1: RoMa is robust, i.e., able to match under extreme changes. We propose RoMa, a model for dense feature matching that is robust to a wide variety of challenging real-world changes in scale, illumination, viewpoint, and texture. We show correspondences estimated by RoMa on the extremely challenging benchmark WxBS [35], where most previous methods fail, and on which we set a new state-of-the-art with an improv… view at source ↗

**Figure 2.** Figure 2: Illustration of our robust approach RoMa. Our contributions are shown with green highlighting and a checkmark, while previous approaches are indicated with gray highlights and a cross. Our first contribution is using a frozen foundation model for coarse features, compared to fine-tuning or training from scratch. DINOv2 lacks fine features, which are needed for accurate correspondences. To tackle this, we c… view at source ↗

**Figure 3.** Figure 3: Illustration of localizability of matches. At infinite resolution the match distribution can be seen as a 2D surface (illustrated as 1D lines in the figure), however at a coarser scale s this distribution becomes blurred due to motion boundaries. This means it is necessary to both use a model and an objective function capable of representing multimodal distributions. encodings. By restricting the model to… view at source ↗

**Figure 4.** Figure 4: Comparison of loss gradients. We use the generalized Charbonnier [3] loss for refinement, which locally matches L2 gradients, but globally decays with |x| −1/2 toward zero. is unimodal locally. However, if this initial choice is far outside the support of the distribution, using a non-robust loss function is problematic. It is therefore motivated to use a robust regression loss for this stage. Loss formu… view at source ↗

**Figure 5.** Figure 5: Evaluation of frozen features. From top to bottom: Image pair, VGG19 matches, RN50 matches, DINOv2 matches, RoMa matches. DINOv2 is significantly more robust than the VGG19 and RN50. Quantitative results are presented in [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative comparison. RoMa is significantly more robust to extreme changes in viewpoint and illumination than DKM. 3 [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

read the original abstract

Feature matching is an important computer vision task that involves estimating correspondences between two images of a 3D scene, and dense methods estimate all such correspondences. The aim is to learn a robust model, i.e., a model able to match under challenging real-world changes. In this work, we propose such a model, leveraging frozen pretrained features from the foundation model DINOv2. Although these features are significantly more robust than local features trained from scratch, they are inherently coarse. We therefore combine them with specialized ConvNet fine features, creating a precisely localizable feature pyramid. To further improve robustness, we propose a tailored transformer match decoder that predicts anchor probabilities, which enables it to express multimodality. Finally, we propose an improved loss formulation through regression-by-classification with subsequent robust regression. We conduct a comprehensive set of experiments that show that our method, RoMa, achieves significant gains, setting a new state-of-the-art. In particular, we achieve a 36% improvement on the extremely challenging WxBS benchmark. Code is provided at https://github.com/Parskatt/RoMa

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RoMa fuses frozen DINOv2 with a ConvNet pyramid and adds anchor-probability decoding plus regression-by-classification loss, delivering a reported 36% gain on WxBS.

read the letter

RoMa's core idea is to take the coarse but robust features from DINOv2, add a ConvNet pyramid for finer localization, feed them into a transformer decoder that outputs anchor probabilities to capture multimodality, and train with a regression-by-classification loss followed by robust regression. The abstract claims this produces a new state of the art, with a 36% relative improvement on the hard WxBS benchmark, and the authors release code at the GitHub link. That combination of pieces looks like the actual novelty; prior work has used DINO features or transformers for matching, but the specific integration and the anchor-probability output do not appear directly in the cited literature. The experiments are described as comprehensive, which is useful for a methods paper in this area. Code release is a clear positive for reproducibility. The main soft spot is that the abstract gives no detail on baseline re-implementations, error bars, or whether any post-hoc tuning occurred on the target benchmarks. The stress-test concern about generalization is fair: the paper does not show results on fresh distributions outside the reported test sets, so it remains open whether the gains hold without retuning. The central performance claim therefore rests on the strength of the full experimental section rather than on any formal guarantee. This paper is aimed at researchers who build or use dense feature matchers for tasks like wide-baseline stereo or 3D reconstruction. A reader working in that subfield would get value from the architecture details and the benchmark numbers. It is coherent on its own terms and shows honest engagement with the literature, so it deserves a serious referee rather than a desk reject.

Referee Report

2 major / 2 minor

Summary. The paper proposes RoMa for robust dense feature matching. It combines frozen DINOv2 features with specialized ConvNet fine features to create a localizable feature pyramid, employs a transformer decoder that predicts anchor probabilities to handle multimodality, and uses an improved regression-by-classification loss followed by robust regression. Comprehensive experiments are reported to show significant gains over prior methods, including a 36% improvement on the challenging WxBS benchmark, establishing a new state-of-the-art. Code is released at the provided GitHub link.

Significance. If the performance claims hold under scrutiny, the work demonstrates a practical way to leverage large-scale pretrained foundation models for improved robustness in dense matching without training from scratch. The open-source code is a clear strength that supports reproducibility and further research.

major comments (2)

[Abstract / Experiments] Abstract and Experiments section: the central claim of a 36% improvement and new SOTA on WxBS is load-bearing, yet the abstract (and by extension the reported experiments) provides no detail on baseline implementations, error bars, statistical significance, or data splits. This directly affects assessment of whether the reported gains are reliable.
[Method / Experiments] Method and Experiments: the robustness claim rests on the specific fusion of frozen DINOv2, added ConvNet features, anchor-probability decoding, and the loss producing stable performance without per-dataset retuning. No ablation or evaluation on fresh distributions outside the reported benchmarks is described to test this assumption.

minor comments (2)

Figure captions and legends should explicitly state the metrics and baselines shown in all quantitative plots for immediate readability.
Notation for the anchor-probability output and the subsequent regression step should be introduced with a single consistent equation reference rather than scattered descriptions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and positive assessment of the work's significance and reproducibility. We respond to each major comment below.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and Experiments section: the central claim of a 36% improvement and new SOTA on WxBS is load-bearing, yet the abstract (and by extension the reported experiments) provides no detail on baseline implementations, error bars, statistical significance, or data splits. This directly affects assessment of whether the reported gains are reliable.

Authors: We agree the abstract is concise and omits these specifics. The experiments section reports results using official baseline implementations and standard dataset protocols for splits. Error bars and significance tests are uncommon in this literature, but gains are consistent across benchmarks. We will revise the abstract to briefly note the evaluation setup, baselines, and data splits, and add a short discussion of reliability in the experiments section. revision: yes
Referee: [Method / Experiments] Method and Experiments: the robustness claim rests on the specific fusion of frozen DINOv2, added ConvNet features, anchor-probability decoding, and the loss producing stable performance without per-dataset retuning. No ablation or evaluation on fresh distributions outside the reported benchmarks is described to test this assumption.

Authors: The robustness without per-dataset retuning is supported by consistent SOTA results across the diverse reported benchmarks, which include extreme variations. While evaluations on entirely new distributions are not included, the current benchmarks test the components under challenging conditions. We will add a discussion of this scope in the experiments section and expand component ablations where feasible. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical architecture validated on external benchmarks

full rationale

The paper presents RoMa as an empirical construction: frozen DINOv2 features combined with added ConvNet fine features, a transformer decoder outputting anchor probabilities, and regression-by-classification loss. All performance claims (including the 36% WxBS gain) are reported from direct experiments on standard benchmarks. No equations, predictions, or first-principles derivations are given that reduce by construction to fitted parameters or self-citations; the central claims rest on measured generalization rather than definitional equivalence.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of the proposed hybrid feature pyramid and decoder; the main external dependency is the robustness property of the frozen DINOv2 backbone taken from prior work.

free parameters (1)

training hyperparameters and loss weights
Standard ML training choices that affect final performance but are not enumerated in the abstract.

axioms (1)

domain assumption DINOv2 features remain significantly more robust than local features trained from scratch under real-world changes
Invoked in the abstract as the starting point for the method.

pith-pipeline@v0.9.0 · 5736 in / 1349 out tokens · 31304 ms · 2026-05-24T08:54:10.678600+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a tailored transformer match decoder that predicts anchor probabilities... regression-by-classification loss for coarse global matches, while we use robust regression loss for the refinement stage
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

leveraging frozen pretrained features from the foundation model DINOv2... creating a precisely localizable feature pyramid

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Improving Local Feature Matching by Entropy-inspired Scale Adaptability and Flow-endowed Local Consistency
cs.CV 2026-04 unverdicted novelty 4.0

A semi-dense image matching pipeline adds scale adaptability via score-matrix hints at the coarse stage and local flow consistency via gradient loss at the fine stage.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

HPatches: A benchmark and evaluation of handcrafted and learned local descriptors

Vassileios Balntas, Karel Lenc, Andrea Vedaldi, and Krys- tian Mikolajczyk. HPatches: A benchmark and evaluation of handcrafted and learned local descriptors. In Proceedings of the IEEE conference on computer vision and pattern recog- nition, pages 5173–5182, 2017

work page 2017
[2]

MAGSAC++, a fast, reliable and accurate robust estimator

Daniel Barath, Jana Noskova, Maksym Ivashechkin, and Jiri Matas. MAGSAC++, a fast, reliable and accurate robust estimator. In Conference on Computer Vision and Pattern Recognition, 2020. 8

work page 2020
[3]

A general and adaptive robust loss func- tion

Jonathan T Barron. A general and adaptive robust loss func- tion. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 4331–4339,

work page
[4]

Surf: Speeded up robust features

Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up robust features. InEuropean conference on com- puter vision, pages 404–417. Springer, 2006. 3

work page 2006
[5]

The robust estimation of multiple motions: Parametric and piecewise-smooth flow fields

Michael J Black and Paul Anandan. The robust estimation of multiple motions: Parametric and piecewise-smooth flow fields. Computer vision and image understanding, 63(1):75– 104, 1996. 3

work page 1996
[6]

On the unification of line processes, outlier rejection, and robust statistics with applications in early vision

Michael J Black and Anand Rangarajan. On the unification of line processes, outlier rejection, and robust statistics with applications in early vision. International journal of com- puter vision, 19(1):57–91, 1996. 3

work page 1996
[7]

A case for using rotation invariant features in state of the art feature matchers

Georg B ¨okman and Fredrik Kahl. A case for using rotation invariant features in state of the art feature matchers. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5110–5119, 2022. 3

work page 2022
[8]

On the Opportunities and Risks of Foundation Models

Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Alt- man, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021. 3

work page internal anchor Pith review Pith/arXiv arXiv 2021
[9]

Large scale joint semantic re-localisation and scene understanding via globally unique instance coor- dinate regression

Ignas Budvytis, Marvin Teichmann, Tomas V ojir, and Roberto Cipolla. Large scale joint semantic re-localisation and scene understanding via globally unique instance coor- dinate regression. In Proceedings of the British Machine Vi- sion Conference (BMVC) , pages 86.1–86.13. BMV A Press,

work page
[10]

Improving transformer-based image matching by cascaded capturing spatially informa- tive keypoints

Chenjie Cao and Yanwei Fu. Improving transformer-based image matching by cascaded capturing spatially informa- tive keypoints. In Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision (ICCV) , pages 12129–12139, 2023. 7

work page 2023
[11]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In Pro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 3

work page 2021
[12]

ASpanFormer: Detector-free image matching with adaptive span transformer

Hongkai Chen, Zixin Luo, Lei Zhou, Yurun Tian, Mingmin Zhen, Tian Fang, David Mckinnon, Yanghai Tsin, and Long Quan. ASpanFormer: Detector-free image matching with adaptive span transformer. InProc. European Conference on Computer Vision (ECCV), 2022. 3, 7, 8

work page 2022
[13]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017. 7, 8

work page 2017
[14]

Superpoint: Self-supervised interest point detection and description

Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops , pages 224–236, 2018. 3, 7

work page 2018
[15]

BERT: Pre-training of deep bidirectional trans- formers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional trans- formers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the As- sociation for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages 4171–4186, Minnea...

work page 2019
[16]

D2-Net: A Trainable CNN for Joint Detection and Description of Lo- cal Features

Mihai Dusmanu, Ignacio Rocco, Tomas Pajdla, Marc Polle- feys, Josef Sivic, Akihiko Torii, and Torsten Sattler. D2-Net: A Trainable CNN for Joint Detection and Description of Lo- cal Features. In Proceedings of the 2019 IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2019. 4

work page 2019
[17]

DKM: Dense kernelized feature matching for geometry estimation

Johan Edstedt, Ioannis Athanasiadis, M ˚arten Wadenb ¨ack, and Michael Felsberg. DKM: Dense kernelized feature matching for geometry estimation. In IEEE Conference on Computer Vision and Pattern Recognition, 2023. 1, 2, 3, 4, 5, 6, 7, 8

work page 2023
[18]

Channel smoothing: Efficient robust smoothing of low-level signal features

Michael Felsberg, P-E Forssen, and H Scharr. Channel smoothing: Efficient robust smoothing of low-level signal features. IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 28(2):209–222, 2006. 3

work page 2006
[19]

Wasserstein distances for stereo disparity estimation

Divyansh Garg, Yan Wang, Bharath Hariharan, Mark Camp- bell, Kilian Q Weinberger, and Wei-Lun Chao. Wasserstein distances for stereo disparity estimation. Advances in Neural Information Processing Systems, 33:22517–22529, 2020. 3

work page 2020
[20]

Neural reprojection error: Merging feature learning and cam- era pose estimation

Hugo Germain, Vincent Lepetit, and Guillaume Bourmaud. Neural reprojection error: Merging feature learning and cam- era pose estimation. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 414–423, 2021. 3

work page 2021
[21]

SiLK: Simple Learned Keypoints

Pierre Gleize, Weiyao Wang, and Matt Feiszli. SiLK: Simple Learned Keypoints. In ICCV, 2023. 7

work page 2023
[22]

Pre- dicting disparity distributions

Gustav H ¨ager, Mikael Persson, and Michael Felsberg. Pre- dicting disparity distributions. In 2021 IEEE International Conference on Robotics and Automation (ICRA) , pages 4363–4369. IEEE, 2021. 3

work page 2021
[23]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 4

work page 2016
[24]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16000–16009, 2022. 1

work page 2022
[25]

Image matching chal- lenge 2022, 2022

Addison Howard, Eduard Trulls, Kwang Moo Yi, Dmitry Mishkin, Sohier Dane, and Yuhe Jin. Image matching chal- lenge 2022, 2022. 7, 8 9

work page 2022
[26]

The structure of images

Jan J Koenderink. The structure of images. Biological cy- bernetics, 50(5):363–370, 1984. 1

work page 1984
[27]

Hierarchical scene coordinate classification and regression for visual localization

Xiaotian Li, Shuzhe Wang, Yi Zhao, Jakob Verbeek, and Juho Kannala. Hierarchical scene coordinate classification and regression for visual localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11983–11992, 2020. 3

work page 2020
[28]

Megadepth: Learning single- view depth prediction from internet photos

Zhengqi Li and Noah Snavely. Megadepth: Learning single- view depth prediction from internet photos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2041–2050, 2018. 4, 7, 8

work page 2041
[29]

Could giant pre-trained image models extract universal representations? Advances in Neu- ral Information Processing Systems, 35:8332–8346, 2022

Yutong Lin, Ze Liu, Zheng Zhang, Han Hu, Nanning Zheng, Stephen Lin, and Yue Cao. Could giant pre-trained image models extract universal representations? Advances in Neu- ral Information Processing Systems, 35:8332–8346, 2022. 1

work page 2022
[30]

Scale-space theory: A basic tool for analyz- ing structures at different scales.Journal of applied statistics, 21(1-2):225–270, 1994

Tony Lindeberg. Scale-space theory: A basic tool for analyz- ing structures at different scales.Journal of applied statistics, 21(1-2):225–270, 1994. 1

work page 1994
[31]

LightGlue: Local Feature Matching at Light Speed

Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Polle- feys. LightGlue: Local Feature Matching at Light Speed. In ICCV, 2023. 7

work page 2023
[32]

Camliflow: bidirectional camera-lidar fusion for joint optical flow and scene flow estimation

Haisong Liu, Tao Lu, Yihui Xu, Jia Liu, Wenjie Li, and Lijun Chen. Camliflow: bidirectional camera-lidar fusion for joint optical flow and scene flow estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5791–5801, 2022. 3

work page 2022
[33]

Distinctive image features from scale- invariant keypoints

David G Lowe. Distinctive image features from scale- invariant keypoints. International journal of computer vi- sion, 60(2):91–110, 2004. 3

work page 2004
[34]

Dgc-net: Dense ge- ometric correspondence network

Iaroslav Melekhov, Aleksei Tiulpin, Torsten Sattler, Marc Pollefeys, Esa Rahtu, and Juho Kannala. Dgc-net: Dense ge- ometric correspondence network. In 2019 IEEE Winter Con- ference on Applications of Computer Vision (WACV), pages 1034–1042. IEEE, 2019. 3

work page 2019
[35]

WxBS: Wide Baseline Stereo Generalizations

Dmytro Mishkin, Jiri Matas, Michal Perdoch, and Karel Lenc. WxBS: Wide Baseline Stereo Generalizations. InPro- ceedings of the British Machine Vision Conference. BMV A,

work page
[36]

Pats: Patch area transportation with subdivision for local feature matching

Junjie Ni, Yijin Li, Zhaoyang Huang, Hongsheng Li, Hujun Bao, Zhaopeng Cui, and Guofeng Zhang. Pats: Patch area transportation with subdivision for local feature matching. In The IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), 2023. 1, 3, 7

work page 2023
[37]

Maxime Oquab, Timoth ´ee Darcet, Theo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Rus- sell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang- Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nico- las Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patri...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning (Adaptive Com- putation and Machine Learning). The MIT Press, 2005. 4, 1

work page 2005
[39]

R2d2: Reliable and repeatable detec- tor and descriptor

Jerome Revaud, Cesar De Souza, Martin Humenberger, and Philippe Weinzaepfel. R2d2: Reliable and repeatable detec- tor and descriptor. Advances in neural information process- ing systems, 32:12405–12415, 2019. 3

work page 2019
[40]

From coarse to fine: Robust hierarchical localization at large scale

Paul-Edouard Sarlin, Cesar Cadena, Roland Siegwart, and Marcin Dymczyk. From coarse to fine: Robust hierarchical localization at large scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 12716–12725, 2019. 1, 8

work page 2019
[41]

Superglue: Learning feature matching with graph neural networks

Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superglue: Learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4938–4947, 2020. 1, 3, 7, 8

work page 2020
[42]

Back to the feature: Learning robust camera localization from pixels to pose

Paul-Edouard Sarlin, Ajaykumar Unagar, Mans Larsson, Hugo Germain, Carl Toft, Viktor Larsson, Marc Pollefeys, Vincent Lepetit, Lars Hammarstrand, Fredrik Kahl, et al. Back to the feature: Learning robust camera localization from pixels to pose. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 3247–3257, 2021. 4

work page 2021
[43]

Structure- from-motion revisited

Johannes L Schonberger and Jan-Michael Frahm. Structure- from-motion revisited. In Proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 4104–4113, 2016. 1

work page 2016
[44]

LoFTR: Detector-free local feature matching with transformers

Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xiaowei Zhou. LoFTR: Detector-free local feature matching with transformers. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 8922–8931, 2021. 1, 3, 7, 8

work page 2021
[45]

Inloc: Indoor visual localization with dense matching and view synthesis

Hajime Taira, Masatoshi Okutomi, Torsten Sattler, Mircea Cimpoi, Marc Pollefeys, Josef Sivic, Tomas Pajdla, and Ak- ihiko Torii. Inloc: Indoor visual localization with dense matching and view synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 7199–7209, 2018. 8

work page 2018
[46]

Quadtree attention for vision transformers

Shitao Tang, Jiahui Zhang, Siyu Zhu, and Ping Tan. Quadtree attention for vision transformers. In International Confer- ence on Learning Representations, 2022. 3, 7

work page 2022
[47]

Prior guided feature enrich- ment network for few-shot segmentation

Zhuotao Tian, Hengshuang Zhao, Michelle Shu, Zhicheng Yang, Ruiyu Li, and Jiaya Jia. Prior guided feature enrich- ment network for few-shot segmentation. IEEE transactions on pattern analysis and machine intelligence , 44(2):1050– 1065, 2020. 1

work page 2020
[48]

Regression by classification

Lu ´ıs Torgo and Jo ˜ao Gama. Regression by classification. In Advances in Artificial Intelligence , pages 51–60, Berlin, Heidelberg, 1996. Springer Berlin Heidelberg. 3

work page 1996
[49]

GOCor: Bringing Globally Optimized Correspon- dence V olumes into Your Neural Network.Advances in Neu- ral Information Processing Systems, 33, 2020

Prune Truong, Martin Danelljan, Luc V Gool, and Radu Timofte. GOCor: Bringing Globally Optimized Correspon- dence V olumes into Your Neural Network.Advances in Neu- ral Information Processing Systems, 33, 2020. 1

work page 2020
[50]

GLU- Net: Global-local universal network for dense flow and cor- respondences

Prune Truong, Martin Danelljan, and Radu Timofte. GLU- Net: Global-local universal network for dense flow and cor- respondences. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 6258– 6268, 2020. 3

work page 2020
[51]

Learning accurate dense correspondences and when 10 to trust them

Prune Truong, Martin Danelljan, Luc Van Gool, and Radu Timofte. Learning accurate dense correspondences and when 10 to trust them. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 5714– 5724, 2021. 3

work page 2021
[52]

PDC-Net+: Enhanced Probabilistic Dense Corre- spondence Network

Prune Truong, Martin Danelljan, Radu Timofte, and Luc Van Gool. PDC-Net+: Enhanced Probabilistic Dense Corre- spondence Network. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023. 1, 4, 6, 7, 8

work page 2023
[53]

Tyszkiewicz, Pascal Fua, and Eduard Trulls

Michal J. Tyszkiewicz, Pascal Fua, and Eduard Trulls. DISK: learning local features with policy gradient. In NeurIPS,

work page
[54]

Proper reuse of image classification features im- proves object detection

Cristina Vasconcelos, Vighnesh Birodkar, and Vincent Du- moulin. Proper reuse of image classification features im- proves object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 13628–13637, 2022. 1

work page 2022
[55]

MatchFormer: Interleaving attention in transformers for feature matching

Qing Wang, Jiaming Zhang, Kailun Yang, Kunyu Peng, and Rainer Stiefelhagen. MatchFormer: Interleaving attention in transformers for feature matching. In Asian Conference on Computer Vision, 2022. 7

work page 2022
[56]

Masked feature predic- tion for self-supervised visual pre-training

Chen Wei, Haoqi Fan, Saining Xie, Chao-Yuan Wu, Alan Yuille, and Christoph Feichtenhofer. Masked feature predic- tion for self-supervised visual pre-training. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14668–14678, 2022. 1

work page 2022
[57]

Weiss and Nitin Indurkhya

Sholom M. Weiss and Nitin Indurkhya. Rule-based regres- sion. In Proceedings of the 13th International Joint Confer- ence on Artificial Intelligence. Chamb´ery, France, August 28 - September 3, 1993, pages 1072–1078. Morgan Kaufmann,

work page 1993
[58]

Weiss and Nitin Indurkhya

Sholom M. Weiss and Nitin Indurkhya. Rule-based machine learning methods for functional prediction. J. Artif. Intell. Res., 3:383–403, 1995. 3

work page 1995
[59]

Andrew P. Witkin. Scale space filtering. Proc. 8th Inter- national Joint on Artificial Intelligence , pages 1091–1022,

work page
[60]

Revealing the dark secrets of masked im- age modeling

Zhenda Xie, Zigang Geng, Jingcheng Hu, Zheng Zhang, Han Hu, and Yue Cao. Revealing the dark secrets of masked im- age modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14475– 14485, 2023. 2

work page 2023
[61]

ASTR: Adaptive spot-guided transformer for consistent local feature matching

Jiahuan Yu, Jiahao Chang, Jianfeng He, Tianzhu Zhang, Jiyang Yu, and Wu Feng. ASTR: Adaptive spot-guided transformer for consistent local feature matching. In The IEEE/CVF Computer Vision and Pattern Recognition Con- ference (CVPR), 2023. 7

work page 2023
[62]

ibot: Image bert pre-training with online tokenizer

Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. ibot: Image bert pre-training with online tokenizer. InInternational Conference on Learn- ing Representations, 2022. 1, 3

work page 2022
[63]

PMatch: Paired masked image modeling for dense geometric matching

Shengjie Zhu and Xiaoming Liu. PMatch: Paired masked image modeling for dense geometric matching. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. 3, 7 11 RoMa: Robust Dense Feature Matching Supplementary Material In this supplementary material, we provide further de- tails and qualitative examples that could n...

work page 2023

[1] [1]

HPatches: A benchmark and evaluation of handcrafted and learned local descriptors

Vassileios Balntas, Karel Lenc, Andrea Vedaldi, and Krys- tian Mikolajczyk. HPatches: A benchmark and evaluation of handcrafted and learned local descriptors. In Proceedings of the IEEE conference on computer vision and pattern recog- nition, pages 5173–5182, 2017

work page 2017

[2] [2]

MAGSAC++, a fast, reliable and accurate robust estimator

Daniel Barath, Jana Noskova, Maksym Ivashechkin, and Jiri Matas. MAGSAC++, a fast, reliable and accurate robust estimator. In Conference on Computer Vision and Pattern Recognition, 2020. 8

work page 2020

[3] [3]

A general and adaptive robust loss func- tion

Jonathan T Barron. A general and adaptive robust loss func- tion. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 4331–4339,

work page

[4] [4]

Surf: Speeded up robust features

Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up robust features. InEuropean conference on com- puter vision, pages 404–417. Springer, 2006. 3

work page 2006

[5] [5]

The robust estimation of multiple motions: Parametric and piecewise-smooth flow fields

Michael J Black and Paul Anandan. The robust estimation of multiple motions: Parametric and piecewise-smooth flow fields. Computer vision and image understanding, 63(1):75– 104, 1996. 3

work page 1996

[6] [6]

On the unification of line processes, outlier rejection, and robust statistics with applications in early vision

Michael J Black and Anand Rangarajan. On the unification of line processes, outlier rejection, and robust statistics with applications in early vision. International journal of com- puter vision, 19(1):57–91, 1996. 3

work page 1996

[7] [7]

A case for using rotation invariant features in state of the art feature matchers

Georg B ¨okman and Fredrik Kahl. A case for using rotation invariant features in state of the art feature matchers. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5110–5119, 2022. 3

work page 2022

[8] [8]

On the Opportunities and Risks of Foundation Models

Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Alt- man, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021. 3

work page internal anchor Pith review Pith/arXiv arXiv 2021

[9] [9]

Large scale joint semantic re-localisation and scene understanding via globally unique instance coor- dinate regression

Ignas Budvytis, Marvin Teichmann, Tomas V ojir, and Roberto Cipolla. Large scale joint semantic re-localisation and scene understanding via globally unique instance coor- dinate regression. In Proceedings of the British Machine Vi- sion Conference (BMVC) , pages 86.1–86.13. BMV A Press,

work page

[10] [10]

Improving transformer-based image matching by cascaded capturing spatially informa- tive keypoints

Chenjie Cao and Yanwei Fu. Improving transformer-based image matching by cascaded capturing spatially informa- tive keypoints. In Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision (ICCV) , pages 12129–12139, 2023. 7

work page 2023

[11] [11]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In Pro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 3

work page 2021

[12] [12]

ASpanFormer: Detector-free image matching with adaptive span transformer

Hongkai Chen, Zixin Luo, Lei Zhou, Yurun Tian, Mingmin Zhen, Tian Fang, David Mckinnon, Yanghai Tsin, and Long Quan. ASpanFormer: Detector-free image matching with adaptive span transformer. InProc. European Conference on Computer Vision (ECCV), 2022. 3, 7, 8

work page 2022

[13] [13]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017. 7, 8

work page 2017

[14] [14]

Superpoint: Self-supervised interest point detection and description

Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops , pages 224–236, 2018. 3, 7

work page 2018

[15] [15]

BERT: Pre-training of deep bidirectional trans- formers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional trans- formers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the As- sociation for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages 4171–4186, Minnea...

work page 2019

[16] [16]

D2-Net: A Trainable CNN for Joint Detection and Description of Lo- cal Features

Mihai Dusmanu, Ignacio Rocco, Tomas Pajdla, Marc Polle- feys, Josef Sivic, Akihiko Torii, and Torsten Sattler. D2-Net: A Trainable CNN for Joint Detection and Description of Lo- cal Features. In Proceedings of the 2019 IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2019. 4

work page 2019

[17] [17]

DKM: Dense kernelized feature matching for geometry estimation

Johan Edstedt, Ioannis Athanasiadis, M ˚arten Wadenb ¨ack, and Michael Felsberg. DKM: Dense kernelized feature matching for geometry estimation. In IEEE Conference on Computer Vision and Pattern Recognition, 2023. 1, 2, 3, 4, 5, 6, 7, 8

work page 2023

[18] [18]

Channel smoothing: Efficient robust smoothing of low-level signal features

Michael Felsberg, P-E Forssen, and H Scharr. Channel smoothing: Efficient robust smoothing of low-level signal features. IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 28(2):209–222, 2006. 3

work page 2006

[19] [19]

Wasserstein distances for stereo disparity estimation

Divyansh Garg, Yan Wang, Bharath Hariharan, Mark Camp- bell, Kilian Q Weinberger, and Wei-Lun Chao. Wasserstein distances for stereo disparity estimation. Advances in Neural Information Processing Systems, 33:22517–22529, 2020. 3

work page 2020

[20] [20]

Neural reprojection error: Merging feature learning and cam- era pose estimation

Hugo Germain, Vincent Lepetit, and Guillaume Bourmaud. Neural reprojection error: Merging feature learning and cam- era pose estimation. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 414–423, 2021. 3

work page 2021

[21] [21]

SiLK: Simple Learned Keypoints

Pierre Gleize, Weiyao Wang, and Matt Feiszli. SiLK: Simple Learned Keypoints. In ICCV, 2023. 7

work page 2023

[22] [22]

Pre- dicting disparity distributions

Gustav H ¨ager, Mikael Persson, and Michael Felsberg. Pre- dicting disparity distributions. In 2021 IEEE International Conference on Robotics and Automation (ICRA) , pages 4363–4369. IEEE, 2021. 3

work page 2021

[23] [23]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 4

work page 2016

[24] [24]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16000–16009, 2022. 1

work page 2022

[25] [25]

Image matching chal- lenge 2022, 2022

Addison Howard, Eduard Trulls, Kwang Moo Yi, Dmitry Mishkin, Sohier Dane, and Yuhe Jin. Image matching chal- lenge 2022, 2022. 7, 8 9

work page 2022

[26] [26]

The structure of images

Jan J Koenderink. The structure of images. Biological cy- bernetics, 50(5):363–370, 1984. 1

work page 1984

[27] [27]

Hierarchical scene coordinate classification and regression for visual localization

Xiaotian Li, Shuzhe Wang, Yi Zhao, Jakob Verbeek, and Juho Kannala. Hierarchical scene coordinate classification and regression for visual localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11983–11992, 2020. 3

work page 2020

[28] [28]

Megadepth: Learning single- view depth prediction from internet photos

Zhengqi Li and Noah Snavely. Megadepth: Learning single- view depth prediction from internet photos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2041–2050, 2018. 4, 7, 8

work page 2041

[29] [29]

Could giant pre-trained image models extract universal representations? Advances in Neu- ral Information Processing Systems, 35:8332–8346, 2022

Yutong Lin, Ze Liu, Zheng Zhang, Han Hu, Nanning Zheng, Stephen Lin, and Yue Cao. Could giant pre-trained image models extract universal representations? Advances in Neu- ral Information Processing Systems, 35:8332–8346, 2022. 1

work page 2022

[30] [30]

Scale-space theory: A basic tool for analyz- ing structures at different scales.Journal of applied statistics, 21(1-2):225–270, 1994

Tony Lindeberg. Scale-space theory: A basic tool for analyz- ing structures at different scales.Journal of applied statistics, 21(1-2):225–270, 1994. 1

work page 1994

[31] [31]

LightGlue: Local Feature Matching at Light Speed

Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Polle- feys. LightGlue: Local Feature Matching at Light Speed. In ICCV, 2023. 7

work page 2023

[32] [32]

Camliflow: bidirectional camera-lidar fusion for joint optical flow and scene flow estimation

Haisong Liu, Tao Lu, Yihui Xu, Jia Liu, Wenjie Li, and Lijun Chen. Camliflow: bidirectional camera-lidar fusion for joint optical flow and scene flow estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5791–5801, 2022. 3

work page 2022

[33] [33]

Distinctive image features from scale- invariant keypoints

David G Lowe. Distinctive image features from scale- invariant keypoints. International journal of computer vi- sion, 60(2):91–110, 2004. 3

work page 2004

[34] [34]

Dgc-net: Dense ge- ometric correspondence network

Iaroslav Melekhov, Aleksei Tiulpin, Torsten Sattler, Marc Pollefeys, Esa Rahtu, and Juho Kannala. Dgc-net: Dense ge- ometric correspondence network. In 2019 IEEE Winter Con- ference on Applications of Computer Vision (WACV), pages 1034–1042. IEEE, 2019. 3

work page 2019

[35] [35]

WxBS: Wide Baseline Stereo Generalizations

Dmytro Mishkin, Jiri Matas, Michal Perdoch, and Karel Lenc. WxBS: Wide Baseline Stereo Generalizations. InPro- ceedings of the British Machine Vision Conference. BMV A,

work page

[36] [36]

Pats: Patch area transportation with subdivision for local feature matching

Junjie Ni, Yijin Li, Zhaoyang Huang, Hongsheng Li, Hujun Bao, Zhaopeng Cui, and Guofeng Zhang. Pats: Patch area transportation with subdivision for local feature matching. In The IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), 2023. 1, 3, 7

work page 2023

[37] [37]

Maxime Oquab, Timoth ´ee Darcet, Theo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Rus- sell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang- Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nico- las Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patri...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[38] [38]

Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning (Adaptive Com- putation and Machine Learning). The MIT Press, 2005. 4, 1

work page 2005

[39] [39]

R2d2: Reliable and repeatable detec- tor and descriptor

Jerome Revaud, Cesar De Souza, Martin Humenberger, and Philippe Weinzaepfel. R2d2: Reliable and repeatable detec- tor and descriptor. Advances in neural information process- ing systems, 32:12405–12415, 2019. 3

work page 2019

[40] [40]

From coarse to fine: Robust hierarchical localization at large scale

Paul-Edouard Sarlin, Cesar Cadena, Roland Siegwart, and Marcin Dymczyk. From coarse to fine: Robust hierarchical localization at large scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 12716–12725, 2019. 1, 8

work page 2019

[41] [41]

Superglue: Learning feature matching with graph neural networks

Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superglue: Learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4938–4947, 2020. 1, 3, 7, 8

work page 2020

[42] [42]

Back to the feature: Learning robust camera localization from pixels to pose

Paul-Edouard Sarlin, Ajaykumar Unagar, Mans Larsson, Hugo Germain, Carl Toft, Viktor Larsson, Marc Pollefeys, Vincent Lepetit, Lars Hammarstrand, Fredrik Kahl, et al. Back to the feature: Learning robust camera localization from pixels to pose. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 3247–3257, 2021. 4

work page 2021

[43] [43]

Structure- from-motion revisited

Johannes L Schonberger and Jan-Michael Frahm. Structure- from-motion revisited. In Proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 4104–4113, 2016. 1

work page 2016

[44] [44]

LoFTR: Detector-free local feature matching with transformers

Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xiaowei Zhou. LoFTR: Detector-free local feature matching with transformers. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 8922–8931, 2021. 1, 3, 7, 8

work page 2021

[45] [45]

Inloc: Indoor visual localization with dense matching and view synthesis

Hajime Taira, Masatoshi Okutomi, Torsten Sattler, Mircea Cimpoi, Marc Pollefeys, Josef Sivic, Tomas Pajdla, and Ak- ihiko Torii. Inloc: Indoor visual localization with dense matching and view synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 7199–7209, 2018. 8

work page 2018

[46] [46]

Quadtree attention for vision transformers

Shitao Tang, Jiahui Zhang, Siyu Zhu, and Ping Tan. Quadtree attention for vision transformers. In International Confer- ence on Learning Representations, 2022. 3, 7

work page 2022

[47] [47]

Prior guided feature enrich- ment network for few-shot segmentation

Zhuotao Tian, Hengshuang Zhao, Michelle Shu, Zhicheng Yang, Ruiyu Li, and Jiaya Jia. Prior guided feature enrich- ment network for few-shot segmentation. IEEE transactions on pattern analysis and machine intelligence , 44(2):1050– 1065, 2020. 1

work page 2020

[48] [48]

Regression by classification

Lu ´ıs Torgo and Jo ˜ao Gama. Regression by classification. In Advances in Artificial Intelligence , pages 51–60, Berlin, Heidelberg, 1996. Springer Berlin Heidelberg. 3

work page 1996

[49] [49]

GOCor: Bringing Globally Optimized Correspon- dence V olumes into Your Neural Network.Advances in Neu- ral Information Processing Systems, 33, 2020

Prune Truong, Martin Danelljan, Luc V Gool, and Radu Timofte. GOCor: Bringing Globally Optimized Correspon- dence V olumes into Your Neural Network.Advances in Neu- ral Information Processing Systems, 33, 2020. 1

work page 2020

[50] [50]

GLU- Net: Global-local universal network for dense flow and cor- respondences

Prune Truong, Martin Danelljan, and Radu Timofte. GLU- Net: Global-local universal network for dense flow and cor- respondences. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 6258– 6268, 2020. 3

work page 2020

[51] [51]

Learning accurate dense correspondences and when 10 to trust them

Prune Truong, Martin Danelljan, Luc Van Gool, and Radu Timofte. Learning accurate dense correspondences and when 10 to trust them. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 5714– 5724, 2021. 3

work page 2021

[52] [52]

PDC-Net+: Enhanced Probabilistic Dense Corre- spondence Network

Prune Truong, Martin Danelljan, Radu Timofte, and Luc Van Gool. PDC-Net+: Enhanced Probabilistic Dense Corre- spondence Network. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023. 1, 4, 6, 7, 8

work page 2023

[53] [53]

Tyszkiewicz, Pascal Fua, and Eduard Trulls

Michal J. Tyszkiewicz, Pascal Fua, and Eduard Trulls. DISK: learning local features with policy gradient. In NeurIPS,

work page

[54] [54]

Proper reuse of image classification features im- proves object detection

Cristina Vasconcelos, Vighnesh Birodkar, and Vincent Du- moulin. Proper reuse of image classification features im- proves object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 13628–13637, 2022. 1

work page 2022

[55] [55]

MatchFormer: Interleaving attention in transformers for feature matching

Qing Wang, Jiaming Zhang, Kailun Yang, Kunyu Peng, and Rainer Stiefelhagen. MatchFormer: Interleaving attention in transformers for feature matching. In Asian Conference on Computer Vision, 2022. 7

work page 2022

[56] [56]

Masked feature predic- tion for self-supervised visual pre-training

Chen Wei, Haoqi Fan, Saining Xie, Chao-Yuan Wu, Alan Yuille, and Christoph Feichtenhofer. Masked feature predic- tion for self-supervised visual pre-training. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14668–14678, 2022. 1

work page 2022

[57] [57]

Weiss and Nitin Indurkhya

Sholom M. Weiss and Nitin Indurkhya. Rule-based regres- sion. In Proceedings of the 13th International Joint Confer- ence on Artificial Intelligence. Chamb´ery, France, August 28 - September 3, 1993, pages 1072–1078. Morgan Kaufmann,

work page 1993

[58] [58]

Weiss and Nitin Indurkhya

Sholom M. Weiss and Nitin Indurkhya. Rule-based machine learning methods for functional prediction. J. Artif. Intell. Res., 3:383–403, 1995. 3

work page 1995

[59] [59]

Andrew P. Witkin. Scale space filtering. Proc. 8th Inter- national Joint on Artificial Intelligence , pages 1091–1022,

work page

[60] [60]

Revealing the dark secrets of masked im- age modeling

Zhenda Xie, Zigang Geng, Jingcheng Hu, Zheng Zhang, Han Hu, and Yue Cao. Revealing the dark secrets of masked im- age modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14475– 14485, 2023. 2

work page 2023

[61] [61]

ASTR: Adaptive spot-guided transformer for consistent local feature matching

Jiahuan Yu, Jiahao Chang, Jianfeng He, Tianzhu Zhang, Jiyang Yu, and Wu Feng. ASTR: Adaptive spot-guided transformer for consistent local feature matching. In The IEEE/CVF Computer Vision and Pattern Recognition Con- ference (CVPR), 2023. 7

work page 2023

[62] [62]

ibot: Image bert pre-training with online tokenizer

Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. ibot: Image bert pre-training with online tokenizer. InInternational Conference on Learn- ing Representations, 2022. 1, 3

work page 2022

[63] [63]

PMatch: Paired masked image modeling for dense geometric matching

Shengjie Zhu and Xiaoming Liu. PMatch: Paired masked image modeling for dense geometric matching. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. 3, 7 11 RoMa: Robust Dense Feature Matching Supplementary Material In this supplementary material, we provide further de- tails and qualitative examples that could n...

work page 2023