Faster or Stronger: Towards Flexible Visual Place Recognition via Weighted Aggregation and Token Pruning

Jan Boehm; Jiaming Zhang; June Moh Goo; Junwei Zheng; Rainer Stiefelhagen; Weijia Fan; Zichao Zeng

arxiv: 2605.20551 · v1 · pith:BBPBQ3PCnew · submitted 2026-05-19 · 💻 cs.CV · cs.AI· cs.RO

Faster or Stronger: Towards Flexible Visual Place Recognition via Weighted Aggregation and Token Pruning

Zichao Zeng , June Moh Goo , Junwei Zheng , Weijia Fan , Jiaming Zhang , Rainer Stiefelhagen , Jan Boehm This is my paper

Pith reviewed 2026-05-21 06:21 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.RO

keywords visual place recognitionvision transformerstoken pruningweighted aggregationself-distillationglobal descriptorsefficiency trade-off

0 comments

The pith

Weighted cluster aggregation and inference-time token pruning let visual place recognition models trade accuracy for speed after one training run.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes two linked techniques for vision-transformer backbones in visual place recognition. It first replaces uniform pooling of patch clusters with learned weights that give more influence to clusters carrying stronger place-specific signals. It then uses the resulting importance scores to train a lightweight pruning head attached early in the transformer via self-distillation, so that at inference users can drop any fraction of tokens without retraining and still retrieve the correct place. The combined approach therefore improves descriptor quality while turning the usual fixed cost of feature extraction into a controllable parameter.

Core claim

Assigning weights to clusters during aggregation yields more discriminative global descriptors for VPR, and the same importance information can supervise a pruning module that supports plug-and-play token reduction at inference after a single joint training phase, outperforming token-pruning techniques transferred from general vision tasks.

What carries the argument

The Weighted Aggregated Descriptor (WeiAD) that multiplies cluster contributions by learned weights, together with the WeiToP self-distillation pipeline that transfers aggregation-derived token importance to an early-layer pruning module.

If this is right

Global descriptors become more discriminative because clusters that matter more for place identity receive higher weight.
Feature extraction cost can be reduced on demand at inference without retraining or separate models for each speed target.
The accuracy-efficiency curve can be adjusted continuously by choosing how many tokens to keep.
VPR-specific pruning outperforms general-purpose token pruning methods when both are applied to the same backbone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same weighting-plus-pruning pattern could be tested on other retrieval problems that rely on transformer patch tokens, such as landmark or product search.
Pairing the pruned descriptors with existing compression techniques would further cut storage and search time for city-scale databases.
Running the method on sequences with strong seasonal or illumination change would reveal whether the learned weights remain stable across domain shifts.

Load-bearing premise

Importance scores produced by the weighted aggregation step remain reliable enough to supervise pruning so that accuracy stays acceptable across different pruning ratios without any further training.

What would settle it

Measuring that top-1 retrieval accuracy on a standard VPR benchmark such as Oxford RobotCar falls more than five percent below the unpruned baseline once half the tokens are removed.

Figures

Figures reproduced from arXiv: 2605.20551 by Jan Boehm, Jiaming Zhang, June Moh Goo, Junwei Zheng, Rainer Stiefelhagen, Weijia Fan, Zichao Zeng.

**Figure 2.** Figure 2: (a) Cluster-to-patch transport heatmaps showing distinct assignment patterns. (b) Token pruning illustration, where squares denote patch tokens and blank ones are pruned redundant tokens. emerged as an effective strategy for accelerating inference by removing spatial tokens that contribute little to downstream tasks (Rao et al., 2021; Meng et al., 2022; Ye et al., 2025; Chen et al., 2024). Prior work has … view at source ↗

**Figure 1.** Figure 1: Star-shaped markers correspond to WeiAD-based models. The solid red star denotes base WeiAD. The yellow line shows our VPR-specific token pruning approach WeiToP integrated with WeiAD across different retention ratios. Other lines indicate WeiAD equipped with different generic token pruning strategies. Single markers show competing VPR methods. gation by incorporating optimal transport (OT) formulations … view at source ↗

**Figure 3.** Figure 3: The unified framework of WeiAD and WeiToP. At training stage, we fine-tune the late layers of DINOv2 ViT-B on GSV-Cities, alongside the score projection, dimension reduction module, WeiToP module, and weight parameters (fire icon). 2. Literature Review Visual Place Recognition. NetVLAD (Arandjelovic et al., 2016) marked a milestone by integrating convolutional neural networks with a differentiable VLAD ag… view at source ↗

**Figure 4.** Figure 4: We activate WeiToP after an early layer during inference. Input tokens undergo WeiToP processing to obtain importance logits, which are then combined with token norms to compute importance scores. The top-α selected tokens are retained and fed into the subsequent blocks. on its effective mass transported from real patch tokens. Specifically, for cluster j, we define its raw contribution score as αj = X N i… view at source ↗

**Figure 5.** Figure 5: Comparison of WeiToP (star lines) with other token pruning methods with different retention ratios ρ [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Efficiency-accuracy trade-off performance of WeiAD + WeiToP (star lines) with different model sizes of DINOv2. lenging benchmarks that exhibit severe seasonal and temporal variations, such as Nordland and AmsterTime, our approach yields substantial gains by leveraging more discriminative representations. On the MSLS-C benchmark, WeiAD achieves a clear margin over existing methods, demonstrating strong … view at source ↗

**Figure 7.** Figure 7: P Visual examples under different conditions. (a) Visualization of the transport mass each token adsorbed in WeiAD, i.e., M j=1 wτ(j) P ⋆ ij , compared with PM j=1 P ⋆ ij in SALAD. (b) Tokens retained after applying WeiToP with retention ratios ρ = 0.95 and 0.5, compared with ToFu, FastV, DynamicViT, and ToMe at approximately ρ = 0.5. *Blank - removed tokens; Orange - merged tokens [PITH_FULL_IMAGE:figure… view at source ↗

**Figure 8.** Figure 8: Efficiency-accuracy trade-off of the WeiToP pruning module located after different layers. ToFu (Kim et al., 2024). These methods are designed for general vision tasks and are integrated into our framework using their default configurations detailed in Appendix D.2 [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Cluster-to-patch transport heatmaps showing that spatial and semantic patterns of different clusters are distinct. This large performance variance provides direct evidence that clusters contribute to VPR with differing degrees of importance. We revisit a traditional method for recognition, i.e., vocabulary tree (Nister & Stewenius, 2006), which applies a weighting mechanism that suppresses nodes with lower… view at source ↗

**Figure 10.** Figure 10: Efficiency-accuracy trade-off of the WeiToP pruning module located after different layers of ViT and the initial tokenizer. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: Efficiency-accuracy trade-off with different balancing coefficient κ for token-level importance scores. 10 3 10 2 10 1 10 0 10 1 78 79 80 81 82 83 84 R e c all@ 1 (%) ( ) Performance Best = 0.1 [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: Performance of WeiAD under different magnitudes of γ during joint learning. F. Limitations and Future Work Despite its effectiveness, our approach has several limitations that suggest promising directions for future research. First, our weighted aggregation relies on a fixed cluster structure learned during training. Although empirical results demonstrate that clusters exhibit stable semantic and spatial … view at source ↗

**Figure 13.** Figure 13: Visual examples across different cities under different conditions. Heatmap of token importance scores and tokens retained after applying WeiToP with retention ratios ρ = 0.95, 0.9, 0.8, 0.7, 0.6, 0.5, and 0.4. *Blank indicates removed tokens. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

read the original abstract

Visual Place Recognition (VPR) aims to match a query image to reference images of the same place in a large-scale database. Recent state-of-the-art methods employ Vision Transformers (ViTs) as backbone foundation models to extract patch-level features that are robust to viewpoint, illumination, and seasonal variations, which are then aggregated into a compact global descriptor for retrieval. Most existing aggregation methods uniformly pool patch tokens into learned clusters, despite the fact that different clusters often encode distinct spatial or semantic patterns and contribute unequally to VPR performance. To address this limitation, we propose Weighted Aggregated Descriptor (WeiAD), which assigns weights to clusters during aggregation, producing more discriminative global representations. Beyond accuracy, retrieval latency is a critical concern for large-scale deployments and resource-constrained edge devices. Prior work mainly reduces latency by compressing global descriptors, while overlooking the cost of feature extraction, an issue exacerbated by ViT-based backbones. We therefore introduce WeiToP, a VPR-oriented token pruning framework that reduces feature extraction cost via self-distillation, where aggregation-induced token importance supervises a lightweight pruning module attached to an early transformer layer, enabling inference-time token pruning. After a single joint training phase, WeiToP enables plug-and-play token pruning at inference time, allowing flexible and on-demand control over the accuracy-efficiency trade-off without additional training. Moreover, WeiToP outperforms existing token pruning methods adapted from general vision tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper adds weighted cluster aggregation for better VPR descriptors and a self-distillation pruning scheme that supports inference-time speed adjustments after one training run.

read the letter

The main takeaway is that they replace uniform pooling with learned weights on clusters to make global descriptors more discriminative, and they add a pruning module trained via self-distillation from those same weights so you can drop tokens early in the ViT at test time without retraining for each speed target. Both pieces are presented as VPR-specific fixes rather than direct copies of general vision techniques. The work does a solid job calling out that feature extraction cost in ViTs is the real bottleneck for edge deployment, not just final descriptor size, and it tries to solve the accuracy-efficiency trade-off in one joint training pass. If the experiments show consistent recall improvements over uniform aggregation and better pruning curves than off-the-shelf methods on standard VPR benchmarks with viewpoint and seasonal shifts, that is practical progress. The soft spot is the transfer assumption behind WeiToP: using final aggregation importance to supervise a lightweight head attached after an early transformer block. Early layers mostly see local edges and textures, while the aggregation sees the full set of tokens, so the supervision signal could latch onto patterns that do not matter for robustness. The stress-test note flags this risk correctly. If the paper includes ablations that measure how pruning affects performance across illumination and seasonal variants, and if the gains hold without large drops, the concern is contained; otherwise it remains the load-bearing question. This is aimed at people building real-time VPR for robotics or mobile robots who already use ViT backbones and need tunable latency. A reader who cares about concrete efficiency tricks on retrieval tasks will get value from the implementation details and numbers. It deserves a serious referee because the proposals are testable and address a deployment constraint that matters. I would send it out for review.

Referee Report

2 major / 2 minor

Summary. The paper proposes two modules for ViT-based Visual Place Recognition: WeiAD, which learns to weight clusters during aggregation to produce more discriminative global descriptors than uniform pooling, and WeiToP, a self-distillation framework that transfers token importance scores derived from the final weighted aggregation to train a lightweight pruning head attached after an early transformer block. After one joint training run, WeiToP permits inference-time token pruning at arbitrary ratios without retraining, aiming to improve the accuracy-efficiency trade-off over both standard VPR pipelines and token-pruning methods transferred from general vision tasks.

Significance. If the empirical claims hold, the work would offer a practical way to obtain stronger global descriptors while simultaneously reducing the dominant cost of ViT feature extraction in large-scale VPR. The plug-and-play character of WeiToP after a single training phase is a notable engineering contribution for edge deployment. However, the significance is tempered by the absence of any quantitative results, ablation tables, or error analysis in the provided abstract; the central claims therefore remain unverified at this stage.

major comments (2)

[WeiToP framework] WeiToP description: the core assumption that final-layer aggregation weights can reliably supervise a pruning module attached to an early transformer block is load-bearing for the 'single-training, plug-and-play' claim. Early blocks primarily encode local texture and edges, while aggregation operates on the final token set; without reported layer-wise correlation statistics or an ablation that measures VPR recall degradation when early-layer importance is used, it is unclear whether the self-distillation objective aligns on VPR-critical structure or on spurious correlations.
[Experiments] Experimental section: the abstract states that WeiToP 'outperforms existing token pruning methods adapted from general vision tasks,' yet no recall@N, latency, or FLOPs numbers, no baseline descriptions, and no ablation on pruning ratios are supplied. Because the soundness of the accuracy-efficiency curves cannot be assessed, the headline claim that flexible control is achieved without per-ratio retraining remains unverified.

minor comments (2)

[Abstract] The abstract is unusually long and contains several compound claims; a shorter, more focused abstract would improve readability.
[Method] Notation for the weighting function in WeiAD and the importance-score head in WeiToP should be introduced with explicit equations rather than prose descriptions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and constructive feedback. We address each major comment below, providing clarifications and indicating revisions to the manuscript where appropriate.

read point-by-point responses

Referee: [WeiToP framework] WeiToP description: the core assumption that final-layer aggregation weights can reliably supervise a pruning module attached to an early transformer block is load-bearing for the 'single-training, plug-and-play' claim. Early blocks primarily encode local texture and edges, while aggregation operates on the final token set; without reported layer-wise correlation statistics or an ablation that measures VPR recall degradation when early-layer importance is used, it is unclear whether the self-distillation objective aligns on VPR-critical structure or on spurious correlations.

Authors: We appreciate the referee pointing out the need for further validation of the self-distillation alignment in WeiToP. While the manuscript describes the framework and its motivation, we acknowledge that explicit layer-wise correlation statistics and a dedicated ablation on recall degradation for early vs. late layer importance were not included. We will add these analyses in the revised version to demonstrate that the transferred importance scores capture VPR-relevant structures rather than spurious correlations. revision: yes
Referee: [Experiments] Experimental section: the abstract states that WeiToP 'outperforms existing token pruning methods adapted from general vision tasks,' yet no recall@N, latency, or FLOPs numbers, no baseline descriptions, and no ablation on pruning ratios are supplied. Because the soundness of the accuracy-efficiency curves cannot be assessed, the headline claim that flexible control is achieved without per-ratio retraining remains unverified.

Authors: The abstract is constrained by length and thus omits specific numerical results, which are presented in detail in the experimental section of the full manuscript, including comparisons with adapted token pruning methods, recall metrics, latency, FLOPs, and ablations across pruning ratios. To address this, we will revise the abstract to include key quantitative findings supporting the claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; proposals are independent architectural modules with external validation

full rationale

The paper introduces WeiAD as a weighted aggregation module and WeiToP as a self-distillation-based pruning framework supervised by aggregation-derived importance scores. These are presented as novel components trained jointly on VPR tasks, with claims supported by empirical comparisons on standard benchmarks rather than any definitional equivalence or reduction of outputs to fitted inputs from the same data. No equations or steps reduce the reported accuracy-efficiency trade-offs to quantities defined by construction from the inputs; the supervision link is a designed training objective, not a tautology. Self-citations for baselines or prior ViT work are not load-bearing for the central claims, which remain falsifiable via external datasets.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; full methods, training details, and experimental setup unavailable. Free parameters likely include cluster weights and pruning thresholds but cannot be enumerated without the manuscript. No invented physical entities; the new modules are algorithmic.

axioms (1)

domain assumption ViT patch tokens encode spatial and semantic patterns that can be meaningfully clustered and weighted for place discrimination.
Implicit in the motivation for moving beyond uniform pooling.

pith-pipeline@v0.9.0 · 5817 in / 1285 out tokens · 47805 ms · 2026-05-21T06:21:54.231197+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

self-distillation... Ldistill = T² · 1/N0 Σ p(t)_i log(p(t)_i / p(s)_i)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 5 internal anchors

[1]

Proceedings of the ieee/cvf conference on computer vision and pattern recognition , pages=

Optimal transport aggregation for visual place recognition , author=. Proceedings of the ieee/cvf conference on computer vision and pattern recognition , pages=

work page
[2]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Rethinking visual geo-localization for large-scale applications , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[3]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Eigenplaces: Training viewpoint robust models for visual place recognition , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[4]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

NetVLAD: CNN architecture for weakly supervised place recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[5]

Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=

Mixvpr: Feature mixing for visual place recognition , author=. Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=

work page
[6]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Cricavpr: Cross-image correlation-aware representation learning for visual place recognition , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[7]

Billion-scale similarity search with

Johnson, Jeff and Douze, Matthijs and J. Billion-scale similarity search with. IEEE Transactions on Big Data , volume=. 2019 , publisher=

work page 2019
[8]

IEEE Robotics and Automation Letters , year=

Anyloc: Towards universal visual place recognition , author=. IEEE Robotics and Automation Letters , year=

work page
[9]

European Conference on Computer Vision , pages=

Revisit anything: Visual place recognition via image segment retrieval , author=. European Conference on Computer Vision , pages=. 2024 , organization=

work page 2024
[10]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

R2former: Unified retrieval and reranking transformer for place recognition , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[11]

Advances in Neural Information Processing Systems , volume=

SuperVLAD: Compact and robust image descriptors for visual place recognition , author=. Advances in Neural Information Processing Systems , volume=

work page
[12]

Towards seamless adaptation of pre-trained models for visual place recognition,

Towards seamless adaptation of pre-trained models for visual place recognition , author=. arXiv preprint arXiv:2402.14505 , year=

work page arXiv
[13]

European Conference on Computer Vision , pages=

VLAD-BuFF: burst-aware fast feature aggregation for visual place recognition , author=. European Conference on Computer Vision , pages=. 2024 , organization=

work page 2024
[14]

IEEE Robotics and Automation Letters , volume=

Dilated Superpixel Aggregation for Visual Place Recognition , author=. IEEE Robotics and Automation Letters , volume=. 2026 , publisher=

work page 2026
[15]

International conference on machine learning , pages=

Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

work page 2021
[16]

DINOv2: Learning Robust Visual Features without Supervision

Dinov2: Learning robust visual features without supervision , author=. arXiv preprint arXiv:2304.07193 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Proceedings of the AAAI Conference on Artificial Intelligence , author=

Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2025 , month=. doi:10.1609/aaai.v39i21.34366 , number=

work page doi:10.1609/aaai.v39i21.34366 2025
[18]

Advances in neural information processing systems , volume=

Dynamicvit: Efficient vision transformers with dynamic token sparsification , author=. Advances in neural information processing systems , volume=

work page
[19]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Adavit: Adaptive vision transformers for efficient image recognition , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[20]

Neurocomputing , volume=

Gsv-cities: Toward appropriate supervised visual place recognition , author=. Neurocomputing , volume=. 2022 , publisher=

work page 2022
[21]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Mapillary street-level sequences: A dataset for lifelong place recognition , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[22]

IEEE Robotics and Automation Letters , volume=

Learning context flexible attention model for long-term visual place recognition , author=. IEEE Robotics and Automation Letters , volume=. 2018 , publisher=

work page 2018
[23]

Are we there yet? Challenging SeqSLAM on a 3000 km journey across all four seasons , author=. Proc. of workshop on long-term autonomy, IEEE international conference on robotics and automation (ICRA) , pages=. 2013 , organization=

work page 2013
[24]

2022 26th International Conference on Pattern Recognition (ICPR) , pages=

Amstertime: A visual place recognition benchmark dataset for severe domain shift , author=. 2022 26th International Conference on Pattern Recognition (ICPR) , pages=. 2022 , organization=

work page 2022
[25]

IEEE transactions on pattern analysis and machine intelligence , volume=

Fine-tuning CNN image retrieval with no human annotation , author=. IEEE transactions on pattern analysis and machine intelligence , volume=. 2018 , publisher=

work page 2018
[26]

Advances in neural information processing systems , volume=

Sinkhorn distances: Lightspeed computation of optimal transport , author=. Advances in neural information processing systems , volume=

work page
[27]

Pacific Journal of Mathematics , volume=

Concerning nonnegative matrices and doubly stochastic matrices , author=. Pacific Journal of Mathematics , volume=. 1967 , publisher=

work page 1967
[28]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Superglue: Learning feature matching with graph neural networks , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[29]

Distilling the Knowledge in a Neural Network

Distilling the knowledge in a neural network , author=. arXiv preprint arXiv:1503.02531 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Proceedings of IEEE International conference on Robotics and Automation , volume=

Visual navigation using view-sequenced route representation , author=. Proceedings of IEEE International conference on Robotics and Automation , volume=. 1996 , organization=

work page 1996
[31]

Vpair-aerial visual place recognition and localization in large-scale outdoor environments

VPAIR--Aerial Visual Place Recognition and Localization in Large-scale Outdoor Environments , author=. arXiv preprint arXiv:2205.11567 , year=

work page arXiv
[32]

European Conference on Computer Vision , pages=

Capturing, reconstructing, and simulating: the urbanscene3d dataset , author=. European Conference on Computer Vision , pages=. 2022 , organization=

work page 2022
[33]

2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06) , volume=

Scalable recognition with a vocabulary tree , author=. 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06) , volume=. 2006 , organization=

work page 2006
[34]

The International Journal of Robotics Research , volume=

Persistent navigation and mapping using a biologically inspired SLAM system , author=. The International Journal of Robotics Research , volume=. 2010 , publisher=

work page 2010
[35]

IEEE transactions on robotics , volume=

ORB-SLAM: A versatile and accurate monocular SLAM system , author=. IEEE transactions on robotics , volume=. 2015 , publisher=

work page 2015
[36]

European conference on computer vision , pages=

Planet-photo geolocation with convolutional neural networks , author=. European conference on computer vision , pages=. 2016 , organization=

work page 2016
[37]

Mo- bilebert: a compact task-agnostic bert for resource-limited devices,

Mobilebert: a compact task-agnostic bert for resource-limited devices , author=. arXiv preprint arXiv:2004.02984 , year=

work page arXiv 2004
[38]

European Conference on Computer Vision , pages=

An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models , author=. European Conference on Computer Vision , pages=. 2024 , organization=

work page 2024
[39]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Multi-similarity loss with general pair weighting for deep metric learning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[40]

DINOv3

Dinov3 , author=. arXiv preprint arXiv:2508.10104 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[41]

Decoupled Weight Decay Regularization

Decoupled weight decay regularization , author=. arXiv preprint arXiv:1711.05101 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[42]

International Conference on Learning Representations , year=

Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations , author=. International Conference on Learning Representations , year=

work page
[43]

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=

Token fusion: Bridging the gap between token pruning and token merging , author=. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=

work page
[44]

Proceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining , pages=

Learned token pruning for transformers , author=. Proceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining , pages=

work page
[45]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Dynamic token pruning in plain vision transformers for semantic segmentation , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[46]

Token Merging: Your ViT But Faster

Token merging: Your vit but faster , author=. arXiv preprint arXiv:2210.09461 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[47]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Zero-TPrune: Zero-shot token pruning through leveraging of the attention graph in pre-trained transformers , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[48]

European conference on computer vision , pages=

Adaptive token sampling for efficient vision transformers , author=. European conference on computer vision , pages=. 2022 , organization=

work page 2022
[49]

arXiv preprint arXiv:2603.27758 , year=

RHO: Robust Holistic OSM-Based Metric Cross-View Geo-Localization , author=. arXiv preprint arXiv:2603.27758 , year=

work page arXiv
[50]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Transgeo: Transformer is all you need for cross-view image geo-localization , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[1] [1]

Proceedings of the ieee/cvf conference on computer vision and pattern recognition , pages=

Optimal transport aggregation for visual place recognition , author=. Proceedings of the ieee/cvf conference on computer vision and pattern recognition , pages=

work page

[2] [2]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Rethinking visual geo-localization for large-scale applications , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[3] [3]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Eigenplaces: Training viewpoint robust models for visual place recognition , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page

[4] [4]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

NetVLAD: CNN architecture for weakly supervised place recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page

[5] [5]

Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=

Mixvpr: Feature mixing for visual place recognition , author=. Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=

work page

[6] [6]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Cricavpr: Cross-image correlation-aware representation learning for visual place recognition , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[7] [7]

Billion-scale similarity search with

Johnson, Jeff and Douze, Matthijs and J. Billion-scale similarity search with. IEEE Transactions on Big Data , volume=. 2019 , publisher=

work page 2019

[8] [8]

IEEE Robotics and Automation Letters , year=

Anyloc: Towards universal visual place recognition , author=. IEEE Robotics and Automation Letters , year=

work page

[9] [9]

European Conference on Computer Vision , pages=

Revisit anything: Visual place recognition via image segment retrieval , author=. European Conference on Computer Vision , pages=. 2024 , organization=

work page 2024

[10] [10]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

R2former: Unified retrieval and reranking transformer for place recognition , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[11] [11]

Advances in Neural Information Processing Systems , volume=

SuperVLAD: Compact and robust image descriptors for visual place recognition , author=. Advances in Neural Information Processing Systems , volume=

work page

[12] [12]

Towards seamless adaptation of pre-trained models for visual place recognition,

Towards seamless adaptation of pre-trained models for visual place recognition , author=. arXiv preprint arXiv:2402.14505 , year=

work page arXiv

[13] [13]

European Conference on Computer Vision , pages=

VLAD-BuFF: burst-aware fast feature aggregation for visual place recognition , author=. European Conference on Computer Vision , pages=. 2024 , organization=

work page 2024

[14] [14]

IEEE Robotics and Automation Letters , volume=

Dilated Superpixel Aggregation for Visual Place Recognition , author=. IEEE Robotics and Automation Letters , volume=. 2026 , publisher=

work page 2026

[15] [15]

International conference on machine learning , pages=

Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

work page 2021

[16] [16]

DINOv2: Learning Robust Visual Features without Supervision

Dinov2: Learning robust visual features without supervision , author=. arXiv preprint arXiv:2304.07193 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Proceedings of the AAAI Conference on Artificial Intelligence , author=

Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2025 , month=. doi:10.1609/aaai.v39i21.34366 , number=

work page doi:10.1609/aaai.v39i21.34366 2025

[18] [18]

Advances in neural information processing systems , volume=

Dynamicvit: Efficient vision transformers with dynamic token sparsification , author=. Advances in neural information processing systems , volume=

work page

[19] [19]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Adavit: Adaptive vision transformers for efficient image recognition , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page

[20] [20]

Neurocomputing , volume=

Gsv-cities: Toward appropriate supervised visual place recognition , author=. Neurocomputing , volume=. 2022 , publisher=

work page 2022

[21] [21]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Mapillary street-level sequences: A dataset for lifelong place recognition , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page

[22] [22]

IEEE Robotics and Automation Letters , volume=

Learning context flexible attention model for long-term visual place recognition , author=. IEEE Robotics and Automation Letters , volume=. 2018 , publisher=

work page 2018

[23] [23]

Are we there yet? Challenging SeqSLAM on a 3000 km journey across all four seasons , author=. Proc. of workshop on long-term autonomy, IEEE international conference on robotics and automation (ICRA) , pages=. 2013 , organization=

work page 2013

[24] [24]

2022 26th International Conference on Pattern Recognition (ICPR) , pages=

Amstertime: A visual place recognition benchmark dataset for severe domain shift , author=. 2022 26th International Conference on Pattern Recognition (ICPR) , pages=. 2022 , organization=

work page 2022

[25] [25]

IEEE transactions on pattern analysis and machine intelligence , volume=

Fine-tuning CNN image retrieval with no human annotation , author=. IEEE transactions on pattern analysis and machine intelligence , volume=. 2018 , publisher=

work page 2018

[26] [26]

Advances in neural information processing systems , volume=

Sinkhorn distances: Lightspeed computation of optimal transport , author=. Advances in neural information processing systems , volume=

work page

[27] [27]

Pacific Journal of Mathematics , volume=

Concerning nonnegative matrices and doubly stochastic matrices , author=. Pacific Journal of Mathematics , volume=. 1967 , publisher=

work page 1967

[28] [28]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Superglue: Learning feature matching with graph neural networks , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page

[29] [29]

Distilling the Knowledge in a Neural Network

Distilling the knowledge in a neural network , author=. arXiv preprint arXiv:1503.02531 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[30] [30]

Proceedings of IEEE International conference on Robotics and Automation , volume=

Visual navigation using view-sequenced route representation , author=. Proceedings of IEEE International conference on Robotics and Automation , volume=. 1996 , organization=

work page 1996

[31] [31]

Vpair-aerial visual place recognition and localization in large-scale outdoor environments

VPAIR--Aerial Visual Place Recognition and Localization in Large-scale Outdoor Environments , author=. arXiv preprint arXiv:2205.11567 , year=

work page arXiv

[32] [32]

European Conference on Computer Vision , pages=

Capturing, reconstructing, and simulating: the urbanscene3d dataset , author=. European Conference on Computer Vision , pages=. 2022 , organization=

work page 2022

[33] [33]

2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06) , volume=

Scalable recognition with a vocabulary tree , author=. 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06) , volume=. 2006 , organization=

work page 2006

[34] [34]

The International Journal of Robotics Research , volume=

Persistent navigation and mapping using a biologically inspired SLAM system , author=. The International Journal of Robotics Research , volume=. 2010 , publisher=

work page 2010

[35] [35]

IEEE transactions on robotics , volume=

ORB-SLAM: A versatile and accurate monocular SLAM system , author=. IEEE transactions on robotics , volume=. 2015 , publisher=

work page 2015

[36] [36]

European conference on computer vision , pages=

Planet-photo geolocation with convolutional neural networks , author=. European conference on computer vision , pages=. 2016 , organization=

work page 2016

[37] [37]

Mo- bilebert: a compact task-agnostic bert for resource-limited devices,

Mobilebert: a compact task-agnostic bert for resource-limited devices , author=. arXiv preprint arXiv:2004.02984 , year=

work page arXiv 2004

[38] [38]

European Conference on Computer Vision , pages=

An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models , author=. European Conference on Computer Vision , pages=. 2024 , organization=

work page 2024

[39] [39]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Multi-similarity loss with general pair weighting for deep metric learning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page

[40] [40]

DINOv3

Dinov3 , author=. arXiv preprint arXiv:2508.10104 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[41] [41]

Decoupled Weight Decay Regularization

Decoupled weight decay regularization , author=. arXiv preprint arXiv:1711.05101 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[42] [42]

International Conference on Learning Representations , year=

Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations , author=. International Conference on Learning Representations , year=

work page

[43] [43]

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=

Token fusion: Bridging the gap between token pruning and token merging , author=. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=

work page

[44] [44]

Proceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining , pages=

Learned token pruning for transformers , author=. Proceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining , pages=

work page

[45] [45]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Dynamic token pruning in plain vision transformers for semantic segmentation , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page

[46] [46]

Token Merging: Your ViT But Faster

Token merging: Your vit but faster , author=. arXiv preprint arXiv:2210.09461 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[47] [47]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Zero-TPrune: Zero-shot token pruning through leveraging of the attention graph in pre-trained transformers , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[48] [48]

European conference on computer vision , pages=

Adaptive token sampling for efficient vision transformers , author=. European conference on computer vision , pages=. 2022 , organization=

work page 2022

[49] [49]

arXiv preprint arXiv:2603.27758 , year=

RHO: Robust Holistic OSM-Based Metric Cross-View Geo-Localization , author=. arXiv preprint arXiv:2603.27758 , year=

work page arXiv

[50] [50]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Transgeo: Transformer is all you need for cross-view image geo-localization , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page