arxiv: 2604.00809 · v2 · submitted 2026-04-01 · 💻 cs.CV · cs.HC· cs.IR

Recognition: no theorem link

Revisiting Human-in-the-Loop Object Retrieval with Pre-Trained Vision Transformers

Kawtar Zaher , Olivier Buisson , Alexis Joly

Authors on Pith no claims yet

Pith reviewed 2026-05-13 23:00 UTC · model grok-4.3

classification 💻 cs.CV cs.HCcs.IR

keywords human-in-the-loopobject retrievalactive learningvision transformersrelevance feedbackmulti-object scenesimage retrievalcluttered images

0 comments

The pith

Pre-trained vision transformers support effective active learning for retrieving specific object classes in cluttered multi-object images through targeted representation choices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper revisits human-in-the-loop object retrieval, where a system starts from a large unlabeled collection and iteratively finds images containing a user-specified object class using only an initial query and relevance feedback. The process is cast as binary classification that improves over rounds via an active learning loop selecting the most informative samples for user annotation. In multi-object cluttered scenes where objects occupy small regions, the work tests different ways to derive localized descriptors from pre-trained ViTs instead of relying on global image features. A reader would care because this setup shows how to build practical interactive search tools that avoid heavy supervision or model retraining while handling real-world image complexity.

Core claim

Pre-trained ViT representations, when paired with appropriate choices for which object instances to consider in an image, the form of user annotations, the active selection strategy, and the aggregation of local versus global features, allow the binary classifier to distinguish relevant images effectively even in complex multi-object datasets, delivering concrete design guidelines for active-learning retrieval pipelines.

What carries the argument

Representation strategies derived from pre-trained Vision Transformers inside an active-learning binary classification loop that selects samples for relevance feedback.

If this is right

Trade-offs arise between capturing global scene context and focusing on fine-grained local object details depending on the chosen representation strategy.
Active selection of samples for annotation measurably refines the classifier's ability to separate relevant from irrelevant images over iterations.
These design choices together produce retrieval pipelines that work on cluttered scenes without requiring model fine-tuning.
Practical guidelines emerge for annotation format and instance handling that balance user effort against performance gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same representation adaptation pattern could apply to interactive retrieval in video or multi-modal collections where localization matters.
Lower annotation budgets might become feasible for users searching large unlabeled archives for uncommon object categories.
Pre-trained models may prove sufficient for many interactive vision tasks if similar lightweight design choices are explored.

Load-bearing premise

That pre-trained ViT representations can be adapted via simple design choices to capture localized object features sufficiently well in multi-object cluttered scenes without task-specific fine-tuning or additional supervision.

What would settle it

An experiment on multi-object datasets showing that none of the tested ViT representation strategies improve retrieval precision over standard global-feature baselines after multiple rounds of active learning and user feedback.

Figures

Figures reproduced from arXiv: 2604.00809 by Alexis Joly, Kawtar Zaher, Olivier Buisson.

**Figure 1.** Figure 1: Retrieved images during earlier iterations for two initial queries ("dog", "fire hydrant"): the global descriptor captures the dominating overall concept("cooking", "road"), the local one recognizes the parts only ("furry animal", "cylindric object"), whereas the hybrid one allows to capture the desired object. Human-in-the-Loop OR addresses this issue by incorporating a Human-in-theLoop’s [22] feedback a… view at source ↗

**Figure 2.** Figure 2: Visual representation of the Human-in-the-Loop Object Retrieval framework. To make the most of the limited feedback provided by the user, the retrieval stage leverages Active Learning (AL) strategies. Since the user can label only a small number of samples per iteration, selecting the most informative samples is crucial. AL allows the system to prioritize examples that are expected to most improve the cla… view at source ↗

**Figure 3.** Figure 3: Iterative performance on Coco2017 [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Iterative performance on PascalVOC2012. The case of Vision Transformers with Registers. Register-based ViTs [8], as an extension of standard ViTs, have been shown to improve dense prediction tasks such as Object Discovery. As our setup share some characteristics with dense tasks, we investigate the use of such models. We use the same architecture and unsupervised training framework to compare. We report t… view at source ↗

read the original abstract

Building on existing approaches, we revisit Human-in-the-Loop Object Retrieval, a task that consists of iteratively retrieving images containing objects of a class-of-interest, specified by a user-provided query. Starting from a large unlabeled image collection, the aim is to rapidly identify diverse instances of an object category relying solely on the initial query and the user's Relevance Feedback, with no prior labels. The retrieval process is formulated as a binary classification task, where the system continuously learns to distinguish between relevant and non-relevant images to the query, through iterative user interaction. This interaction is guided by an Active Learning loop: at each iteration, the system selects informative samples for user annotation, thereby refining the retrieval performance. This task is particularly challenging in multi-object datasets, where the object of interest may occupy only a small region of the image within a complex, cluttered scene. Unlike object-centered settings where global descriptors often suffice, multi-object images require more adapted, localized descriptors. In this work, we formulate and revisit the Human-in-the-Loop Object Retrieval task by leveraging pre-trained ViT representations, and addressing key design questions, including which object instances to consider in an image, what form the annotations should take, how Active Selection should be applied, and which representation strategies best capture the object's features. We compare several representation strategies across multi-object datasets highlighting trade-offs between capturing the global context and focusing on fine-grained local object details. Our results offer practical insights for the design of effective interactive retrieval pipelines based on Active Learning for object class retrieval.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Pre-trained ViTs get plugged into active learning for object retrieval with some useful design comparisons, but the evidence for simple representation tweaks beating baselines in clutter is still thin.

read the letter

The paper takes pre-trained Vision Transformers and applies them to human-in-the-loop object retrieval, where you start with a query and use active learning plus user feedback to pull out images containing a target class from a big unlabeled pool. They focus on multi-object cluttered scenes and run comparisons across representation choices like class token, patch averaging, and attention maps, plus decisions on annotation format and sample selection. That produces some concrete trade-off notes between global context and local object focus, which could help people tuning annotation pipelines without full retraining.

Referee Report

3 major / 1 minor

Summary. The paper revisits Human-in-the-Loop Object Retrieval as an active-learning binary classification task that iteratively retrieves images of a user-specified object class from a large unlabeled collection using only an initial query and relevance feedback. It focuses on multi-object cluttered scenes where the target may occupy a small image region, and compares pre-trained ViT representation strategies (class token, patch averaging, attention maps) to address design choices on object instances, annotation forms, active selection, and feature localization, ultimately offering practical insights for interactive retrieval pipelines.

Significance. If the empirical comparisons and trade-offs hold, the work supplies actionable guidance on adapting frozen pre-trained ViTs for localized object retrieval without task-specific fine-tuning or extra supervision, which could streamline active-learning pipelines in computer vision applications involving cluttered scenes.

major comments (3)

Abstract: The description of comparisons across representation strategies provides no quantitative results, error bars, dataset details, or evaluation protocol, preventing verification of whether the central claim that simple ViT design choices suffice for localized features is supported.
§3 (Representation strategies): The claim that pre-trained ViT token strategies capture localized object features in multi-object scenes relies on the untested assumption that global ImageNet supervision yields sufficient locality for targets occupying <10% of the image; no ablation against a frozen CNN baseline under identical active-learning selection is reported, leaving open whether gains arise from the ViT or the AL loop.
§4 (Experiments): Absence of concrete metrics, dataset statistics, or protocol details (e.g., how active selection is applied across representation variants) is load-bearing for the asserted trade-offs between global context and fine-grained local details.

minor comments (1)

Introduction: The list of addressed design questions would benefit from an explicit enumeration or table to improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our work revisiting Human-in-the-Loop Object Retrieval with pre-trained ViTs. We address each major comment below and indicate revisions to strengthen the manuscript.

read point-by-point responses

Referee: Abstract: The description of comparisons across representation strategies provides no quantitative results, error bars, dataset details, or evaluation protocol, preventing verification of whether the central claim that simple ViT design choices suffice for localized features is supported.

Authors: We agree the abstract is high-level. The full paper reports quantitative results (e.g., precision@10 and mAP with error bars over 5 runs) on multi-object datasets like COCO subsets and OpenImages, with the protocol in §4.1. We will revise the abstract to include key quantitative highlights, dataset names, and a brief protocol reference. revision: yes
Referee: §3 (Representation strategies): The claim that pre-trained ViT token strategies capture localized object features in multi-object scenes relies on the untested assumption that global ImageNet supervision yields sufficient locality for targets occupying <10% of the image; no ablation against a frozen CNN baseline under identical active-learning selection is reported, leaving open whether gains arise from the ViT or the AL loop.

Authors: The section isolates the effect of ViT representation choices (class token vs. patch averaging vs. attention maps) under a fixed AL loop to focus on design trade-offs for localization in cluttered scenes. We acknowledge the value of a CNN baseline for broader context. We will add a frozen ResNet-50 ablation using the identical active-learning selection and annotation protocol in the revised experiments. revision: yes
Referee: §4 (Experiments): Absence of concrete metrics, dataset statistics, or protocol details (e.g., how active selection is applied across representation variants) is load-bearing for the asserted trade-offs between global context and fine-grained local details.

Authors: Section 4 reports concrete metrics (mAP, precision-recall), dataset statistics (image counts, object area ratios <10%), and applies the same uncertainty-based active selection uniformly across all representation variants (detailed in §4.2–4.3). We will add a summary table of the protocol and cross-references to improve clarity without altering the results. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical comparisons are self-contained

full rationale

The paper is an empirical study that compares ViT representation strategies (class token, patch averaging, attention maps) within an active-learning loop for object retrieval on multi-object datasets. No equations, predictions, or derivations are presented that reduce reported results to quantities fitted from the same evaluation data. Design choices are tested via ablation on held-out datasets rather than being defined in terms of the outcomes they produce. Self-citations, if present, are not load-bearing for any central claim.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Review performed from abstract only; full text unavailable so ledger entries are inferred at high level from stated assumptions.

free parameters (1)

representation strategy choice
Paper compares several strategies for capturing object features; the selection is a design choice that affects reported performance.

axioms (1)

domain assumption Pre-trained ViT features are sufficiently informative for distinguishing relevant objects in multi-object scenes when combined with active selection
Invoked when the work states that ViT representations address the challenge of localized descriptors in cluttered images.

pith-pipeline@v0.9.0 · 5581 in / 1228 out tokens · 28211 ms · 2026-05-13T23:00:41.875629+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 3 internal anchors

[1]

In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Abdali, A., Gripon, V., Drumetz, L., Boguslawski, B.: Active learning for efficient few-shot classification. In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 1–5. IEEE (2023)

work page 2023
[2]

Op- timizing active learning for low annotation budgets.arXiv preprint arXiv:2201.07200, 2022

Aggarwal, U., Popescu, A., Hudelot, C.: Optimizing active learning for low anno- tation budgets. arXiv preprint arXiv:2201.07200 (2022)

work page arXiv 2022
[3]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T., Sivic, J.: Netvlad: Cnn ar- chitecture for weakly supervised place recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 5297–5307 (2016)

work page 2016
[4]

In: 2012 IEEE conference on computer vision and pattern recog- nition

Arandjelović, R., Zisserman, A.: Three things everyone should know to improve object retrieval. In: 2012 IEEE conference on computer vision and pattern recog- nition. pp. 2911–2918. IEEE (2012) 14 K. Zaher et al

work page 2012
[5]

In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16

Cao, B., Araujo, A., Sim, J.: Unifying deep local and global features for image search. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16. pp. 726–743. Springer (2020)

work page 2020
[6]

IEEE Transactions on Pattern Analysis and Machine Intelligence45(6), 7270–7292 (2022)

Chen, W., Liu, Y., Wang, W., Bakker, E.M., Georgiou, T., Fieguth, P., Liu, L., Lew, M.S.: Deep learning for instance retrieval: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence45(6), 7270–7292 (2022)

work page 2022
[7]

In: Workshop on statistical learning in computer vision, ECCV

Csurka, G., Dance, C., Fan, L., Willamowski, J., Bray, C.: Visual categorization with bags of keypoints. In: Workshop on statistical learning in computer vision, ECCV. vol. 1, pp. 1–2. Prague (2004)

work page 2004
[8]

Vision Transformers Need Registers

Darcet, T., Oquab, M., Mairal, J., Bojanowski, P.: Vision transformers need reg- isters. arXiv preprint arXiv:2309.16588 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

IEEE Transactions on Geoscience and Remote Sensing53(5), 2323–2334 (2014)

Demir, B., Bruzzone, L.: A novel active learning method in relevance feedback for content-based remote sensing image retrieval. IEEE Transactions on Geoscience and Remote Sensing53(5), 2323–2334 (2014)

work page 2014
[10]

Computers in Biology and Medicine 196, 110640 (2025)

Denner, S., Zimmerer, D., Bounias, D., Bujotzek, M., Xiao, S., Stock, R., Kausch, L., Schader, P., Penzkofer, T., Jäger, P.F., et al.: Leveraging foundation models for content-based image retrieval in radiology. Computers in Biology and Medicine 196, 110640 (2025)

work page 2025
[11]

Deselaers, T., Hanbury, A., Viitaniemi, V., Benczúr, A., Brendel, M., Daróczy, B., Escalante Balderas, H.J., Gevers, T., Hernández Gracidas, C.A., Hoi, S.C., et al.: Overview of the imageclef 2007 object retrieval task. In: Advances in Multilingual and Multimodal Information Retrieval: 8th Workshop of the Cross-Language Eval- uation Forum, CLEF 2007, Buda...

work page 2007
[12]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2010
[13]

arXiv preprint arXiv:2102.05644 (2021)

El-Nouby, A., Neverova, N., Laptev, I., Jégou, H.: Training vision transformers for image retrieval. arXiv preprint arXiv:2102.05644 (2021)

work page arXiv 2021
[14]

http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html

Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html

work page 2012
[15]

IEEE Transactions on Geoscience and Remote Sensing45(4), 818–826 (2007)

Ferecatu, M., Boujemaa, N.: Interactive remote-sensing image retrieval using active relevance feedback. IEEE Transactions on Geoscience and Remote Sensing45(4), 818–826 (2007)

work page 2007
[16]

In: European Conference on Computer Vision

Garg, K., Puligilla, S.S., Kolathaya, S., Krishna, M., Garg, S.: Revisit anything: Visual place recognition via image segment retrieval. In: European Conference on Computer Vision. pp. 326–343. Springer (2024)

work page 2024
[17]

In: 2021 IEEE Intelligent Vehicles Symposium Workshops (IV Workshops)

Govindarajan, H., Lindskog, P., Lundström, D., Olmin, A., Roll, J., Lindsten, F.: Self-supervisedrepresentationlearningforcontentbasedimageretrievalofcomplex scenes. In: 2021 IEEE Intelligent Vehicles Symposium Workshops (IV Workshops). pp. 249–256. IEEE (2021)

work page 2021
[18]

In: 2010 IEEE computer society conference on computer vision and pattern recognition

Jégou, H., Douze, M., Schmid, C., Pérez, P.: Aggregating local descriptors into a compact image representation. In: 2010 IEEE computer society conference on computer vision and pattern recognition. pp. 3304–3311. IEEE (2010)

work page 2010
[19]

In: Computer vision– ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13

Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: Computer vision– ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13. pp. 740–755. Springer (2014) Revisiting Human-in-the-Loop Object Retrieval with Pre-Trained ViTs 15

work page 2014
[20]

Litayem, S., Joly, A., Boujemaa, N.: Interactive objects retrieval with efficient boosting.In:Proceedingsofthe17thACMinternationalconferenceonMultimedia. pp. 545–548 (2009)

work page 2009
[21]

Interna- tional journal of computer vision60, 91–110 (2004)

Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Interna- tional journal of computer vision60, 91–110 (2004)

work page 2004
[22]

Arti- ficial Intelligence Review56(4), 3005–3054 (2023)

Mosqueira-Rey, E., Hernández-Pereira, E., Alonso-Ríos, D., Bobes-Bascarán, J., Fernández-Leal, Á.: Human-in-the-loop machine learning: a state of the art. Arti- ficial Intelligence Review56(4), 3005–3054 (2023)

work page 2023
[23]

International Journal of Electrical and Computer Engineering 6(6), 3238 (2016)

Ngo, G.T., Ngo, T.Q., Nguyen, D.D.: Image retrieval with relevance feedback using svm active learning. International Journal of Electrical and Computer Engineering 6(6), 3238 (2016)

work page 2016
[24]

DINOv2: Learning Robust Visual Features without Supervision

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

arXiv preprint arXiv:1511.08458 (2015)

O’shea, K., Nash, R.: An introduction to convolutional neural networks. arXiv preprint arXiv:1511.08458 (2015)

work page arXiv 2015
[26]

Journal of Applied Computer Science & Mathematics (10) (2011)

Patil, P.B., Kokare, M.B.: Relevance feedback in content based image retrieval: A review. Journal of Applied Computer Science & Mathematics (10) (2011)

work page 2011
[27]

In: 2007 IEEE conference on computer vision and pattern recognition

Perronnin, F., Dance, C.: Fisher kernels on visual vocabularies for image catego- rization. In: 2007 IEEE conference on computer vision and pattern recognition. pp. 1–8. IEEE (2007)

work page 2007
[28]

In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops

Razavian, A.S., Azizpour, H., Sullivan, J., Carlsson, S.: Cnn features off-the-shelf: an astounding baseline for recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops. pp. 806–813 (2014)

work page 2014
[29]

Settles, B.: Active learning literature survey (2009)

work page 2009
[30]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Shao, S., Chen, K., Karpur, A., Cui, Q., Araujo, A., Cao, B.: Global features are all you need for image retrieval and reranking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 11036–11046 (2023)

work page 2023
[31]

In: Proceedings of the IEEE/CVF winter conference on applications of computer vision

Song, C.H., Yoon, J., Choi, S., Avrithis, Y.: Boosting vision transformers for image retrieval. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 107–117 (2023)

work page 2023
[32]

arXiv preprint arXiv:2207.00287 (2022)

Song, Y., Zhu, R., Yang, M., He, D.: Dalg: Deep attentive local and global modeling for image retrieval. arXiv preprint arXiv:2207.00287 (2022)

work page arXiv 2022
[33]

In: proceedings of the IEEE/CVF international conference on computer vision

Tan,F.,Yuan,J.,Ordonez,V.:Instance-levelimageretrievalusingrerankingtrans- formers. In: proceedings of the IEEE/CVF international conference on computer vision. pp. 12105–12115 (2021)

work page 2021
[34]

Particular object retrieval with integral max-pooling of CNN activations

Tolias, G., Sicre, R., Jégou, H.: Particular object retrieval with integral max- pooling of cnn activations. arXiv preprint arXiv:1511.05879 (2015)

work page Pith review arXiv 2015
[35]

In: Proceedings of the ninth ACM international conference on Multimedia

Tong, S., Chang, E.: Support vector machine active learning for image retrieval. In: Proceedings of the ninth ACM international conference on Multimedia. pp. 107–118 (2001)

work page 2001
[36]

ACM computing surveys (csur)53(3), 1–34 (2020)

Wang, Y., Yao, Q., Kwok, J.T., Ni, L.M.: Generalizing from a few examples: A survey on few-shot learning. ACM computing surveys (csur)53(3), 1–34 (2020)

work page 2020
[37]

In: Proceedings of the IEEE/CVF International conference on Computer Vision

Yang,M.,He,D.,Fan,M.,Shi,B.,Xue,X.,Li,F.,Ding,E.,Huang,J.:Dolg:Single- stage image retrieval with deep orthogonal fusion of local and global features. In: Proceedings of the IEEE/CVF International conference on Computer Vision. pp. 11772–11781 (2021)

work page 2021
[38]

In: CVPRW 2026 - The 13th Workshop on Fine-Grained Visual Categorization (FGVC13) (2026)

Zaher, K., Buisson, O., Joly, A.: Positive-first most ambiguous: A simple active learning criterion for interactive retrieval of rare categories. In: CVPRW 2026 - The 13th Workshop on Fine-Grained Visual Categorization (FGVC13) (2026)

work page 2026