Recognition: unknown
Who Handles Orientation? Investigating Invariance in Feature Matching
Pith reviewed 2026-05-10 15:01 UTC · model grok-4.3
The pith
Incorporating rotation invariance in the descriptor achieves similar performance to handling it in the matcher but enables faster matchers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Incorporating rotation invariance already in the descriptor yields similar performance to handling it in the matcher. However, rotation invariance is achieved earlier in the matcher when it is learned in the descriptor, allowing for a faster rotation-invariant matcher. Enforcing rotation invariance does not hurt upright performance when trained at scale. Increasing the training data size substantially improves generalization to rotated images.
What carries the argument
The stage at which rotation invariance is incorporated, either in the keypoint descriptor or the subsequent feature matcher, within a modern sparse matching pipeline.
If this is right
- Rotation invariance can be learned at the descriptor stage with matching accuracy comparable to later incorporation in the matcher.
- Early invariance in the descriptor supports simpler and faster matcher designs.
- Large-scale training with augmentation maintains performance on non-rotated images.
- Increasing training data size improves generalization to rotated views across benchmarks.
Where Pith is reading between the lines
- Pipeline designers could prioritize early-stage invariance to reduce computational cost in rotation-robust systems.
- The benefits of scale suggest that broader 3D data collection may enhance robustness to other transformations without new model constraints.
- Similar stage-comparison experiments could be applied to scale invariance or illumination changes in matching pipelines.
Load-bearing premise
The chosen collection of 3D vision datasets together with the data augmentation strategy will produce rotation invariance that generalizes to held-out evaluation benchmarks without dataset-specific overfitting.
What would settle it
A large performance drop on a new benchmark containing rotation angles or image types outside the training augmentation distribution would challenge the generalization of the learned invariance.
Figures
read the original abstract
Finding matching keypoints between images is a core problem in 3D computer vision. However, modern matchers struggle with large in-plane rotations. A straightforward mitigation is to learn rotation invariance via data augmentation. However, it remains unclear at which stage rotation invariance should be incorporated. In this paper, we study this in the context of a modern sparse matching pipeline. We perform extensive experiments by training on a large collection of 3D vision datasets and evaluating on popular image matching benchmarks. Surprisingly, we find that incorporating rotation invariance already in the descriptor yields similar performance to handling it in the matcher. However, rotation invariance is achieved earlier in the matcher when it is learned in the descriptor, allowing for a faster rotation-invariant matcher. Further, we find that enforcing rotation invariance does not hurt upright performance when trained at scale. Finally, we study the emergence of rotation invariance through scale and find that increasing the training data size substantially improves generalization to rotated images. We release two matchers robust to in-plane rotations that achieve state-of-the-art performance on e.g. multi-modal (WxBS), extreme (HardMatch), and satellite image matching (SatAst). Code is available at https://github.com/davnords/loma.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This paper investigates the stage at which rotation invariance should be incorporated in modern sparse keypoint matching pipelines for 3D computer vision. By training models on a large aggregated collection of 3D vision datasets using data augmentation and evaluating on held-out standard benchmarks (including multi-modal WxBS, extreme HardMatch, and satellite SatAst), the authors find that descriptor-level invariance yields performance comparable to matcher-level handling. They further report that descriptor-level invariance emerges earlier in the pipeline (enabling lighter matchers), does not degrade upright performance when trained at scale, and improves with larger training sets. Two rotation-robust matchers are released that achieve state-of-the-art results on the cited challenging benchmarks.
Significance. If the central empirical claims hold, the work supplies practical design guidance for rotation-robust matchers that can be both accurate and computationally lighter. The scale of the training regime, controlled ablations isolating descriptor versus matcher invariance, scaling experiments on generalization, and public code release constitute clear strengths that support reproducibility and adoption in 3D vision pipelines. The results also provide evidence that descriptor-level invariance can be sufficient without sacrificing upright performance.
minor comments (3)
- [Abstract and §4] Abstract and §4 (Experiments): the manuscript refers to 'a large collection of 3D vision datasets' without enumerating them or their relative sizes; a brief table or list would help readers evaluate the training distribution and potential domain coverage.
- [§4.2] §4.2 and training details: while the released code enables verification, the paper provides limited explicit description of exact training schedules, optimizer settings, and hyperparameter choices for the descriptor-only versus matcher-augmented variants; adding a concise summary table would improve clarity without altering the claims.
- [§5] Figure captions and §5 (Results): several plots compare rotated versus upright performance; ensuring all axis labels explicitly state the rotation range (e.g., 0–180°) and the exact metric (e.g., AUC@5°) would prevent minor misinterpretation.
Simulated Author's Rebuttal
We thank the referee for the positive review and recommendation to accept the manuscript. The summary accurately captures our empirical investigation into the stage at which rotation invariance should be incorporated in sparse keypoint matching pipelines.
Circularity Check
No significant circularity; empirical results are self-contained
full rationale
The paper reports controlled ablation experiments that train rotation-invariant descriptors and matchers on aggregated 3D vision datasets and evaluate generalization on held-out benchmarks (e.g., WxBS, HardMatch, SatAst). No equations, fitted parameters, or self-citations are used to derive performance claims; all reported outcomes (descriptor vs. matcher invariance, scaling effects, upright preservation) follow directly from the experimental protocol and external test sets. This is the standard non-circular case for large-scale empirical computer-vision work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Data augmentation can be used to learn rotation invariance in neural network descriptors and matchers for feature matching.
Reference graph
Works this paper leans on
-
[1]
Mapillary planet-scale depth dataset
Manuel L ´opez Antequera, Pau Gargallo, Markus Hofinger, Samuel Rota Bulo, Yubin Kuang, and Peter Kontschieder. Mapillary planet-scale depth dataset. InECCV, 2020. 4
2020
-
[2]
Map-free visual relocalization: Metric pose relative to a single image
Eduardo Arnold, Jamie Wynn, Sara Vicente, Guillermo Garcia-Hernando, Aron Monszpart, Victor Prisacariu, Dani- yar Turmukhambetov, and Eric Brachmann. Map-free visual relocalization: Metric pose relative to a single image. In ECCV, 2022. 4
2022
-
[3]
Scenescript: Reconstructing scenes with an autoregressive structured language model
Armen Avetisyan, Christopher Xie, Henry Howard-Jenkins, Tsun-Yi Yang, Samir Aroudj, Suvam Patra, Fuyang Zhang, Duncan Frost, Luke Holland, Campbell Orme, Jakob Engel, Edward Miller, Richard Newcombe, and Vasileios Balntas. Scenescript: Reconstructing scenes with an autoregressive structured language model. InECCV, 2025. 4
2025
-
[4]
Surf: Speeded up robust features
Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up robust features. InECCV, 2006. 2
2006
-
[5]
Earthmatch: Iterative coregistration for fine-grained localization of astro- naut photography
Gabriele Berton, Gabriele Goletto, Gabriele Trivigno, Alex Stoken, Barbara Caputo, and Carlo Masone. Earthmatch: Iterative coregistration for fine-grained localization of astro- naut photography. InCVPR, 2024. 5
2024
-
[6]
A case for using rota- tion invariant features in state of the art feature matchers
Georg B ¨okman and Fredrik Kahl. A case for using rota- tion invariant features in state of the art feature matchers. In CVPRW, 2022. 2
2022
-
[7]
Affine steerers for structured keypoint descrip- tion
Georg B ¨okman, Johan Edstedt, Michael Felsberg, and Fredrik Kahl. Affine steerers for structured keypoint descrip- tion. InECCV, 2024. 2
2024
-
[8]
Steerers: A framework for rotation equivariant keypoint descriptors
Georg B ¨okman, Johan Edstedt, Michael Felsberg, and Fredrik Kahl. Steerers: A framework for rotation equivariant keypoint descriptors. InCVPR, 2024. 1, 2, 7
2024
-
[9]
Flop- ping for flops: Leveraging equivariance for computational efficiency
Georg B ¨okman, David Nordstr ¨om, and Fredrik Kahl. Flop- ping for flops: Leveraging equivariance for computational efficiency. InICML, 2025. 7
2025
-
[10]
Yohann Cabon, Naila Murray, and Martin Humenberger. Vir- tual kitti 2.arXiv preprint arXiv:2001.10773, 2020. 4
work page internal anchor Pith review arXiv 2001
-
[11]
Group equivariant convolu- tional networks
Taco Cohen and Max Welling. Group equivariant convolu- tional networks. InICML. PMLR, 2016. 2
2016
-
[12]
Scannet: Richly-annotated 3d reconstructions of indoor scenes
Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, 2017. 4
2017
-
[13]
Superpoint: Self-supervised interest point detection and description
Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. Superpoint: Self-supervised interest point detection and description. InCVPR, 2018. 2
2018
-
[14]
Mast3r-sfm: a fully-integrated solution for unconstrained structure-from-motion
Bardienus Pieter Duisterhof, Lojze Zust, Philippe Weinza- epfel, Vincent Leroy, Yohann Cabon, and Jerome Revaud. Mast3r-sfm: a fully-integrated solution for unconstrained structure-from-motion. In3dv, 2025. 3
2025
-
[15]
DKM: Dense kernelized feature matching for geometry estimation
Johan Edstedt, Ioannis Athanasiadis, M ˚arten Wadenb ¨ack, and Michael Felsberg. DKM: Dense kernelized feature matching for geometry estimation. InCVPR, 2023. 4
2023
-
[16]
Dedode v2: Analyzing and improving the dedode keypoint detector
Johan Edstedt, Georg B ¨okman, and Zhenjun Zhao. Dedode v2: Analyzing and improving the dedode keypoint detector. InCVPRW, 2024. 3
2024
-
[17]
DeDoDe: Detect, Don’t Describe – De- scribe, Don’t Detect for Local Feature Matching
Johan Edstedt, Georg B ¨okman, M ˚arten Wadenb ¨ack, and Michael Felsberg. DeDoDe: Detect, Don’t Describe – De- scribe, Don’t Detect for Local Feature Matching. In3dv,
-
[18]
RoMa: Robust dense feature matching
Johan Edstedt, Qiyu Sun, Georg B ¨okman, M ˚arten Wadenb¨ack, and Michael Felsberg. RoMa: Robust dense feature matching. InCVPR, 2024. 2, 3, 4
2024
-
[19]
Johan Edstedt, Georg B ¨okman, M ˚arten Wadenb ¨ack, and Michael Felsberg. Dad: Distilled reinforcement learn- ing for diverse keypoint detection.arXiv preprint arXiv:2503.07347, 2025. 1, 3
-
[20]
Roma v2: Harder better faster denser feature matching.arXiv preprint arXiv:2511.15706, 2025
Johan Edstedt, David Nordstr ¨om, Yushan Zhang, Georg B¨okman, Jonathan Astermark, Viktor Larsson, Anders Hey- den, Fredrik Kahl, M ˚arten Wadenb¨ack, and Michael Fels- berg. Roma v2: Harder better faster denser feature matching. arXiv preprint arXiv:2511.15706, 2025. 2, 4, 5
-
[21]
Virtual worlds as proxy for multi-object tracking anal- ysis
Adrien Gaidon, Qiao Wang, Yohann Cabon, and Eleonora Vig. Virtual worlds as proxy for multi-object tracking anal- ysis. InCVPR, 2016. 4
2016
-
[22]
Cambridge university press,
Richard Hartley and Andrew Zisserman.Multiple view ge- ometry in computer vision. Cambridge university press,
-
[23]
Megasynth: Scaling up 3d scene reconstruction with synthe- sized data
Hanwen Jiang, Zexiang Xu, Desai Xie, Ziwen Chen, Haian Jin, Fujun Luan, Zhixin Shu, Kai Zhang, Sai Bi, Xin Sun, Ji- uxiang Gu, Qixing Huang, Georgios Pavlakos, and Hao Tan. Megasynth: Scaling up 3d scene reconstruction with synthe- sized data. InCVPR, 2025. 4
2025
-
[24]
Self- supervised learning of image scale and orientation
Jongmin Lee, Yoonwoo Jeong, and Minsu Cho. Self- supervised learning of image scale and orientation. In BMVC, 2021. 2
2021
-
[25]
Self- supervised equivariant learning for oriented keypoint detec- tion
Jongmin Lee, Byungjin Kim, and Minsu Cho. Self- supervised equivariant learning for oriented keypoint detec- tion. InCVPR, 2022. 1, 2
2022
-
[26]
Learning rotation-equivariant features for visual corre- spondence
Jongmin Lee, Byungjin Kim, Seungwook Kim, and Minsu Cho. Learning rotation-equivariant features for visual corre- spondence. InCVPR, 2023. 1, 2
2023
-
[27]
Megadepth: Learning single- view depth prediction from internet photos
Zhengqi Li and Noah Snavely. Megadepth: Learning single- view depth prediction from internet photos. InCVPR, 2018. 2, 4, 6, 7
2018
-
[28]
LightGlue: Local Feature Matching at Light Speed
Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Polle- feys. LightGlue: Local Feature Matching at Light Speed. In ICCV, 2023. 1, 2
2023
-
[29]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019. 4 9
2019
-
[30]
Distinctive image features from scale- invariant keypoints.IJCV, 2004
David G Lowe. Distinctive image features from scale- invariant keypoints.IJCV, 2004. 1, 2
2004
-
[31]
A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation
Nikolaus Mayer, Eddy Ilg, Philip Hausser, Philipp Fischer, Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. InCVPR, 2016. 4
2016
-
[32]
WxBS: Wide baseline stereo generalizations
Dmytro Mishkin, Jiri Matas, Michal Perdoch, and Karel Lenc. WxBS: Wide baseline stereo generalizations. In BMVC, 2015. 5
2015
-
[33]
Repeata- bility Is Not Enough: Learning Affine Regions via Discrim- inability
Dmytro Mishkin, Filip Radenovi ´c, and Jiˇri Matas. Repeata- bility Is Not Enough: Learning Affine Regions via Discrim- inability. InECCV, 2018. 2
2018
-
[34]
Stronger vits with octic equivariance.arXiv preprint arXiv:2505.15441, 2025
David Nordstr ¨om, Johan Edstedt, Fredrik Kahl, and Georg B¨okman. Stronger vits with octic equivariance.arXiv preprint arXiv:2505.15441, 2025. 7
-
[35]
Loma: Lo- cal feature matching revisited, 2026
David Nordstr ¨om, Johan Edstedt, Georg B ¨okman, Jonathan Astermark, Anders Heyden, Viktor Larsson, M ˚arten Wadenb¨ack, Michael Felsberg, and Fredrik Kahl. Loma: Lo- cal feature matching revisited, 2026. 1, 2, 3, 4, 5
2026
-
[36]
Com- mon objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction
Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Com- mon objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. InICCV, 2021. 4
2021
-
[37]
Susskind
Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M. Susskind. Hypersim: A photorealistic syn- thetic dataset for holistic indoor scene understanding. In ICCV, 2021. 4
2021
-
[38]
ORB: An efficient alternative to SIFT or SURF
Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. ORB: An efficient alternative to SIFT or SURF. InICCV, 2011. 2
2011
-
[39]
Superglue: Learning feature matching with graph neural networks
Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superglue: Learning feature matching with graph neural networks. InCVPR, 2020. 1, 2, 4
2020
-
[40]
Oriane Sim ´eoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timoth´ee Darcet, Th´eo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie,...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
LoFTR: Detector-free local feature matching with transformers
Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xiaowei Zhou. LoFTR: Detector-free local feature matching with transformers. InCVPR, 2021. 2, 3, 4, 6, 7
2021
-
[42]
Smd-nets: Stereo mixture density networks
Fabio Tosi, Yiyi Liao, Carolin Schmitt, and Andreas Geiger. Smd-nets: Stereo mixture density networks. InCVPR, 2021. 4
2021
-
[43]
Deit iii: Revenge of the vit
Hugo Touvron, Matthieu Cord, and Herv ´e J ´egou. Deit iii: Revenge of the vit. InComputer Vision – ECCV 2022, pages 516–533, Cham, 2022. Springer Nature Switzerland. 1
2022
-
[44]
Megascenes: Scene-level view synthesis at scale
Joseph Tung, Gene Chou, Ruojin Cai, Guandao Yang, Kai Zhang, Gordon Wetzstein, Bharath Hariharan, and Noah Snavely. Megascenes: Scene-level view synthesis at scale. InECCV, 2024. 4
2024
-
[45]
Tilde: A temporally invariant learned detector
Yannick Verdie, Kwang Yi, Pascal Fua, and Vincent Lepetit. Tilde: A temporally invariant learned detector. InCVPR,
-
[46]
Aerialmegadepth: Learning aerial-ground reconstruction and view synthesis
Khiem Vuong, Anurag Ghosh, Deva Ramanan, Srinivasa Narasimhan, and Shubham Tulsiani. Aerialmegadepth: Learning aerial-ground reconstruction and view synthesis. In CVPR, 2025. 4
2025
-
[47]
Vggt: Vi- sual geometry grounded transformer
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InCVPR, 2025. 1
2025
-
[48]
Spatialvid: A large-scale video dataset with spatial annotations.arXiv preprint arXiv:2509.09676,
Jiahao Wang, Yufeng Yuan, Rujie Zheng, Youtian Lin, Jian Gao, Lin-Zhuo Chen, Yajie Bao, Yi Zhang, Chang Zeng, Yanxi Zhou, et al. Spatialvid: A large-scale video dataset with spatial annotations.arXiv preprint arXiv:2509.09676,
-
[49]
Tartanair: A dataset to push the limits of visual slam
Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Se- bastian Scherer. Tartanair: A dataset to push the limits of visual slam. InIROS, 2020. 4
2020
-
[50]
Maurice Weiler, Patrick Forr ´e, Erik Verlinde, and Max Welling.Equivariant and Coordinate Independent Convo- lutional Networks. 2023. 2
2023
-
[51]
Blendedmvs: A large- scale dataset for generalized multi-view stereo networks
Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren, Lei Zhou, Tian Fang, and Long Quan. Blendedmvs: A large- scale dataset for generalized multi-view stereo networks. In CVPR, 2020. 4
2020
-
[52]
Scannet++: A high-fidelity dataset of 3d indoor scenes
Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d indoor scenes. InICCV, 2023. 4
2023
-
[53]
Lift: Learned invariant feature transform
Kwang Moo Yi, Eduard Trulls, Vincent Lepetit, and Pascal Fua. Lift: Learned invariant feature transform. InECCV,
-
[54]
Ufm: A simple path towards unified dense correspondence with flow
Yuchen Zhang, Nikhil Keetha, Chenwei Lyu, Bhuvan Jhamb, Yutian Chen, Yuheng Qiu, Jay Karhade, Shreyas Jha, Yaoyu Hu, Deva Ramanan, Sebastian Scherer, and Wenshan Wang. Ufm: A simple path towards unified dense correspondence with flow. InNeurIPS, 2025. 2
2025
-
[55]
Xiaoming Zhao, Xingming Wu, Weihai Chen, Peter C. Y . Chen, Qingsong Xu, and Zhengguo Li. Aliked: A lighter keypoint and descriptor extraction network via deformable transformation.IEEE TIP, 72, 2023. 1, 2 10
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.