Example-Based Object Detection

arxiv: 2605.04501 · v1 · submitted 2026-05-06 · 💻 cs.CV · cs.AI

Example-Based Object Detection

ZhiXin Sun This is my paper

Pith reviewed 2026-05-08 17:57 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords object detectionfalse positive suppressionopen-vocabulary detectionfeature matchingerror examplesSAMDINOv3LightGlue

0 comments p. Extension

The pith

EBOD suppresses repeated false positives and negatives in open-vocabulary object detection by matching prior error examples, without retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets the practical issue that object detectors like SAM3 still produce persistent false positives and false negatives on the same objects, even though retraining for each new error is expensive in time and resources. It proposes storing examples of those errors and using feature matching to recognize and filter matching instances in future images. The approach combines the prompt-based SAM3 detector with DINOv3 and LightGlue for robust matching of error instances. A reader would care because this offers an incremental way to make deployed detectors more reliable over time in real applications where the same mistakes keep recurring.

Core claim

The EBOD framework integrates a prompt-based detector such as SAM3 with DINOv3 and LightGlue feature matching so that previous false-positive and false-negative examples can be stored and used to suppress identical errors when they reappear in new images, achieving this without any model retraining.

What carries the argument

The EBOD pipeline that matches stored error examples against new-image features via DINOv3 and LightGlue to filter SAM3 detections.

Load-bearing premise

Feature matching between stored error examples and new images can reliably identify and suppress the exact same false positives or negatives.

What would settle it

A test image containing a previously recorded false positive that the system still outputs as a detection after matching the error example.

Figures

Figures reproduced from arXiv: 2605.04501 by ZhiXin Sun.

**Figure 1.** Figure 1: Overview of the proposed EBOD framework. Step1: Use INSID3 to generate candidate view at source ↗

**Figure 2.** Figure 2: Images of the same object from two different viewpoints view at source ↗

**Figure 3.** Figure 3: Given a missed detection case, we visualize the detection results produced by the proposed view at source ↗

read the original abstract

In recent years, object detection has achieved significant progress, especially in the field of open-vocabulary object detection. Unlike traditional methods that rely on predefined categories, open-vocabulary approaches can detect arbitrary objects based on human-provided prompts. With the advancement of prompt-based detection techniques, models such as SAM3 can even outperform some category-specific detectors trained on particular datasets without requiring additional training on those datasets. However, despite these advancements, false positives and false negatives still occur. In practical engineering applications, persistent misdetections or missed detections of the same object are unacceptable. Yet retraining the model every time such errors occur incurs substantial costs in terms of human effort, computational resources, and time. Therefore, how to leverage existing false positive and false negative samples to prevent such errors from recurring remains a highly challenging and urgent problem. To address this issue, we propose EBOD (Example-Based Object Detection), which integrates a prompt-based detector (SAM3) with robust feature matching modules (DINOv3 and LightGlue). The proposed framework effectively suppresses the repeated occurrence of false positives and false negatives by leveraging previous error examples, without requiring additional model retraining. Code is available at https://github.com/sunzx97/examples_based_object_detection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper proposes an engineering integration of SAM3 with DINOv3 and LightGlue to suppress repeated FP/FN errors via stored examples without retraining, but supplies no experiments or results to show it works.

read the letter

The punchline for this paper is that it offers an engineering approach to cut down on repeated mistakes in open-vocabulary detectors by matching stored error examples, but it doesn't show whether the approach actually works. What is new is the specific setup for using previous false positives and negatives to influence future detections in SAM3 via DINOv3 embeddings and LightGlue matching. The paper does a good job explaining the motivation: in practical applications, retraining for every persistent error is too costly, so remembering and suppressing those cases is a reasonable idea. The description of the components is straightforward, and the availability of code is a plus for reproducibility. The soft spots are significant though. There are no experiments, metrics, or validation results at all. The claim that it effectively suppresses repeated errors is not supported by any data, so we can't tell if the feature matching reliably identifies the same instances under real-world variations like changes in viewpoint or lighting. The paper doesn't discuss potential failure modes, such as incorrect matches leading to missed detections or new false positives. Without those, the central assumption remains untested. The work is more of a proposal than a completed study. This is the kind of thing that might interest someone building a production system who is willing to experiment with the code themselves. For academic readers or those wanting solid evidence, it falls short. The thinking is clear on the problem setup, but the lack of evidence means it doesn't hold up as a research contribution yet. I would not send this to peer review in its current state. It needs quantitative results and analysis before it would be worth a referee's time.

Referee Report

2 major / 1 minor

Summary. The paper proposes EBOD (Example-Based Object Detection), a framework that integrates the prompt-based open-vocabulary detector SAM3 with robust feature-matching modules DINOv3 and LightGlue. It claims to suppress repeated false positives and false negatives by leveraging prior error examples as references, without any model retraining or fine-tuning.

Significance. If the matching-based suppression mechanism proves reliable, the approach would offer a low-cost, training-free way to improve detection consistency in deployed systems where repeated errors on the same objects are costly. The availability of code is a positive factor for reproducibility.

major comments (2)

[Abstract] Abstract: The central claim that the integration 'effectively suppresses the repeated occurrence of false positives and false negatives' is presented without any supporting experiments, quantitative metrics (e.g., reduction in FP/FN rate, matching precision/recall on error instances), ablation studies, or failure-mode analysis. No validation data or comparison against baselines appears.
[Abstract] Abstract: The effectiveness hinges on the unstated details of how DINOv3+LightGlue matching identifies prior FP/FN instances and applies suppression (negative prompts, mask exclusion, or score adjustment). No similarity thresholds, handling of appearance variation (viewpoint, illumination, occlusion), or bounds on matching reliability are provided, leaving the load-bearing assumption unverified.

minor comments (1)

[Abstract] The GitHub link is provided but no description of the repository contents, example usage, or datasets used for any internal testing is given in the manuscript.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the integration 'effectively suppresses the repeated occurrence of false positives and false negatives' is presented without any supporting experiments, quantitative metrics (e.g., reduction in FP/FN rate, matching precision/recall on error instances), ablation studies, or failure-mode analysis. No validation data or comparison against baselines appears.

Authors: We agree that the abstract, in its current form, presents the central claim at a high level without quantitative support or references to validation. The manuscript body outlines the EBOD framework but does not yet contain the requested experiments, metrics, ablations, or baseline comparisons. We will revise the abstract to remove the unsubstantiated claim of effectiveness and instead describe the intended mechanism, while adding a new experimental section with quantitative results, failure-mode analysis, and comparisons in the revised manuscript. revision: yes
Referee: [Abstract] Abstract: The effectiveness hinges on the unstated details of how DINOv3+LightGlue matching identifies prior FP/FN instances and applies suppression (negative prompts, mask exclusion, or score adjustment). No similarity thresholds, handling of appearance variation (viewpoint, illumination, occlusion), or bounds on matching reliability are provided, leaving the load-bearing assumption unverified.

Authors: We agree that the abstract omits these implementation details. The current manuscript text does not specify similarity thresholds, robustness to appearance changes, or reliability bounds. In the revision we will expand the abstract with a concise description of the matching pipeline (DINOv3 feature extraction followed by LightGlue matching to prior error examples, followed by score adjustment or mask exclusion) and add the missing parameters and analysis to the method section. revision: yes

Circularity Check

0 steps flagged

No circularity: engineering integration of existing components with no derivations or fitted predictions

full rationale

The paper presents EBOD as a practical framework that combines the off-the-shelf prompt-based detector SAM3 with feature-matching modules DINOv3 and LightGlue. It claims this integration suppresses repeated false positives and negatives by using prior error examples, without retraining. No equations, mathematical derivations, parameter fitting, or self-citations appear in the abstract or described approach. The central claim is an empirical engineering assertion about effectiveness, not a derived result that reduces to its inputs by construction. No load-bearing steps match any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the untested assumption that feature matching will correctly map new images to prior error cases and that suppression logic will then improve detection.

axioms (1)

domain assumption DINOv3 and LightGlue feature matching can accurately identify and match previous false-positive and false-negative instances in new images
Invoked as the mechanism that enables error suppression without retraining.

pith-pipeline@v0.9.0 · 5500 in / 1142 out tokens · 54508 ms · 2026-05-08T17:57:47.922579+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 6 canonical work pages · 1 internal anchor

[1]

Detect anything via next point prediction

Qing Jiang, Junan Huo, Xingyu Chen, Yuda Xiong, Zhaoyang Zeng, Yihao Chen, Tianhe Ren, Junzhi Yu, and Lei Zhang. Detect anything via next point prediction, 2025. URL https://arxiv.org/abs/2510.12798

work page arXiv 2025
[2]

T-rex2: Towards generic object detection via text-visual prompt synergy, 2024

Qing Jiang, Feng Li, Zhaoyang Zeng, Tianhe Ren, Shilong Liu, and Lei Zhang. T-rex2: Towards generic object detection via text-visual prompt synergy, 2024

2024
[3]

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Zhou, Liliane ...

work page Pith review arXiv 2025
[4]

Grounded sam: Assembling open-world models for diverse visual tasks, 2024

Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, Zhaoyang Zeng, Hao Zhang, Feng Li, Jie Yang, Hongyang Li, Qing Jiang, and Lei Zhang. Grounded sam: Assembling open-world models for diverse visual tasks, 2024

2024
[5]

Few-shot semantic segmentation meets sam3,

Yi-Jen Tsai, Yen-Yu Lin, and Chien-Yao Wang. Few-shot semantic segmentation meets sam3,
[6]

URLhttps://arxiv.org/abs/2604.05433

work page internal anchor Pith review Pith/arXiv arXiv
[7]

INSID3: Training-free in-context segmentation with DINOv3

Claudia Cuttano, Gabriele Trivigno, Christoph Reich, Daniel Cremers, Carlo Masone, and Stefan Roth. INSID3: Training-free in-context segmentation with DINOv3. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

2026
[8]

arXiv preprint arXiv:2305.13310 (2023)

Yang Liu, Muzhi Zhu, Hengtao Li, Hao Chen, Xinlong Wang, and Chunhua Shen. Matcher: Segment anything with one shot using all-purpose feature matching.arXiv preprint arXiv:2305.13310, 2023

work page arXiv 2023
[9]

LightGlue: Local Feature Matching at Light Speed

Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Pollefeys. LightGlue: Local Feature Matching at Light Speed. InICCV, 2023

2023
[10]

Omniglue: Gener- alizable feature matching with foundation model guidance

Hanwen Jiang, Arjun Karpur, Bingyi Cao, Qixing Huang, and Andre Araujo. Omniglue: Gener- alizable feature matching with foundation model guidance. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024
[11]

Oriane Siméoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julie...

work page Pith review arXiv 2025
[12]

A density-based algorithm for discovering clusters in large spatial databases with noise

Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. InProceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD’96, page 226–231. AAAI Press, 1996

1996
[13]

SuperPoint: Self-Supervised Interest Point Detection and Description

Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superpoint: Self-supervised interest point detection and description, 2018. URL https://arxiv.org/abs/1712.07629

work page Pith review arXiv 2018
[14]

You only look once: Unified, real-time object detection

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016. 5

2016

[1] [1]

Detect anything via next point prediction

Qing Jiang, Junan Huo, Xingyu Chen, Yuda Xiong, Zhaoyang Zeng, Yihao Chen, Tianhe Ren, Junzhi Yu, and Lei Zhang. Detect anything via next point prediction, 2025. URL https://arxiv.org/abs/2510.12798

work page arXiv 2025

[2] [2]

T-rex2: Towards generic object detection via text-visual prompt synergy, 2024

Qing Jiang, Feng Li, Zhaoyang Zeng, Tianhe Ren, Shilong Liu, and Lei Zhang. T-rex2: Towards generic object detection via text-visual prompt synergy, 2024

2024

[3] [3]

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Zhou, Liliane ...

work page Pith review arXiv 2025

[4] [4]

Grounded sam: Assembling open-world models for diverse visual tasks, 2024

Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, Zhaoyang Zeng, Hao Zhang, Feng Li, Jie Yang, Hongyang Li, Qing Jiang, and Lei Zhang. Grounded sam: Assembling open-world models for diverse visual tasks, 2024

2024

[5] [5]

Few-shot semantic segmentation meets sam3,

Yi-Jen Tsai, Yen-Yu Lin, and Chien-Yao Wang. Few-shot semantic segmentation meets sam3,

[6] [6]

URLhttps://arxiv.org/abs/2604.05433

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

INSID3: Training-free in-context segmentation with DINOv3

Claudia Cuttano, Gabriele Trivigno, Christoph Reich, Daniel Cremers, Carlo Masone, and Stefan Roth. INSID3: Training-free in-context segmentation with DINOv3. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

2026

[8] [8]

arXiv preprint arXiv:2305.13310 (2023)

Yang Liu, Muzhi Zhu, Hengtao Li, Hao Chen, Xinlong Wang, and Chunhua Shen. Matcher: Segment anything with one shot using all-purpose feature matching.arXiv preprint arXiv:2305.13310, 2023

work page arXiv 2023

[9] [9]

LightGlue: Local Feature Matching at Light Speed

Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Pollefeys. LightGlue: Local Feature Matching at Light Speed. InICCV, 2023

2023

[10] [10]

Omniglue: Gener- alizable feature matching with foundation model guidance

Hanwen Jiang, Arjun Karpur, Bingyi Cao, Qixing Huang, and Andre Araujo. Omniglue: Gener- alizable feature matching with foundation model guidance. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024

[11] [11]

Oriane Siméoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julie...

work page Pith review arXiv 2025

[12] [12]

A density-based algorithm for discovering clusters in large spatial databases with noise

Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. InProceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD’96, page 226–231. AAAI Press, 1996

1996

[13] [13]

SuperPoint: Self-Supervised Interest Point Detection and Description

Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superpoint: Self-supervised interest point detection and description, 2018. URL https://arxiv.org/abs/1712.07629

work page Pith review arXiv 2018

[14] [14]

You only look once: Unified, real-time object detection

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016. 5

2016