arxiv: 2604.23344 · v1 · submitted 2026-04-25 · 💻 cs.CV

Recognition: unknown

Exploring Hierarchical Consistency and Unbiased Objectness for Open-Vocabulary Object Detection

Sanghoon Lee , Geon Lee , Hyekang Park , Bumsub Ham

Authors on Pith no claims yet

Pith reviewed 2026-05-08 08:35 UTC · model grok-4.3

classification 💻 cs.CV

keywords open-vocabulary detectionpseudo labelinghierarchical calibrationCLIP adaptationobjectness scoringvision language modelsregion proposal networks

0 comments

The pith

Hierarchical consistency calibration and an objectness-adapted CLIP produce reliable pseudo labels for detecting novel object classes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to overcome two main problems in open-vocabulary object detection: VLMs give poor region-level class labels because they are trained on whole images, and RPNs give biased objectness scores because they only see base classes. To fix the first, it uses consistency of predictions across class, super-category, and sub-category levels to calibrate confidence in the labels. For the second, it modifies CLIP in a lightweight way by adding an objectness token that learns to score how much a region contains an object without favoring base classes. If these work, detectors can be trained on pseudo labels that cover many more object types accurately, expanding what they can recognize without new labeled data.

Core claim

By measuring consistency of class predictions at multiple levels of a semantic hierarchy and by inserting a dedicated objectness token into a parameter-efficient version of CLIP, the framework generates trustworthy pseudo labels that include both accurate categories and unbiased objectness scores for classes never seen in the detector's training data.

What carries the argument

Hierarchical Confidence Calibration (HCC) that checks agreement across semantic levels for label reliability, together with LoCLIP that augments CLIP with an objectness token to generate unbiased region scores.

If this is right

Detectors trained using these pseudo labels reach higher mean average precision on novel classes in standard benchmarks.
The approach mitigates the base-class bias that previously limited region proposal quality for unseen objects.
Consistency across hierarchy levels serves as a proxy for correctness without needing ground-truth region labels.
LoCLIP remains efficient to train while improving generalization of objectness estimation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar consistency ideas might help in other pseudo-labeling scenarios such as semi-supervised learning where label noise is an issue.
The objectness token could be tested in fully zero-shot settings to see if base-class training can be avoided entirely.
Applying this to video or 3D data might require extending the hierarchy to temporal or spatial relations.

Load-bearing premise

That predictions agreeing across different levels of the class hierarchy are more likely to be correct at the region level, and that the added token in the CLIP adaptation does not create new biases specific to the novel classes.

What would settle it

If removing the hierarchical consistency step or the objectness token causes the detection performance on novel classes to drop back to levels seen in prior methods, the contribution of these components would be falsified.

Figures

Figures reproduced from arXiv: 2604.23344 by Bumsub Ham, Geon Lee, Hyekang Park, Sanghoon Lee.

**Figure 1.** Figure 1: Visualization of hierarchical consistency of candidate view at source ↗

**Figure 2.** Figure 2: Overview of our framework for OVD, which mainly consists of three steps. First, a set of candidate regions is extracted from view at source ↗

**Figure 3.** Figure 3: An illustration of the operations in the HCC technique view at source ↗

**Figure 4.** Figure 4: Illustration of LoCLIP. LoCLIP appends a learnable view at source ↗

read the original abstract

Conventional object detectors typically operate under a closed-set assumption, limiting recognition to a predefined set of base classes seen during training. Open-vocabulary object detection (OVD) addresses this limitation by leveraging vision-language models (VLMs) to generate pseudo labels for novel object classes. However, existing OVD methods suffer from two critical drawbacks: (1) inaccurate class label assignments, as VLMs are optimized for image-level predictions rather than the region-level predictions required for pseudo labeling, and (2) unreliable objectness scores from region proposal networks (RPNs) trained exclusively on base object classes. To address these issues, we propose a novel pseudo labeling framework for OVD. Our approach introduces a hierarchical confidence calibration (HCC) technique, which ensures reliable class label estimation by assessing consistency across hierarchical semantic levels (class, super- and sub-category). We also present LoCLIP, a parameter-efficient adaptation of CLIP that incorporates an objectness token to mitigate base class bias problem of RPNs and provide reliable objectness estimations for novel object classes. Extensive experiments on standard OVD benchmarks, including COCO and LVIS, demonstrate that our approach clearly sets a new state of the art, validating the effectiveness of our approach. Project site: https://cvlab.yonsei.ac.kr/projects/HCC

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HCC and LoCLIP target real OVD pseudo-labeling problems with hierarchy checks and an objectness token, but the consistency assumption needs direct validation against ground truth rather than just benchmark gains.

read the letter

The paper's main moves are hierarchical confidence calibration to refine class labels by checking agreement across semantic levels and LoCLIP, a light CLIP tweak that adds an objectness token to reduce base-class bias in proposals for novel categories. These directly address the region-level prediction gap and RPN limitations noted in prior OVD work. The experiments on COCO and LVIS report SOTA numbers, which suggests the tweaks produce measurable lifts in practice. The ideas are straightforward and build on documented weaknesses without obvious circularity. HCC and LoCLIP are presented as specific additions not standard in the cited literature, so the novelty sits in those concrete implementations rather than broad new theory. The paper does a solid job framing the two drawbacks and proposing targeted fixes that can be tried on existing pipelines. The benchmark results give a practical sense of impact. The soft spot is the load-bearing assumption for HCC: that cross-level consistency reliably flags correct region labels. If the underlying VLM carries correlated biases (for example, the same visual confusion appearing at class and super-class levels), then consistent predictions could simply reinforce errors instead of correcting them. The abstract and stress-test note flag this exact risk, and without explicit checks showing that high-consistency regions match ground truth more often than low-consistency ones, the calibration step's contribution stays partly unanchored. LoCLIP's token addition is simple enough that its isolated effect would benefit from clearer ablations too. This is for people already working on open-vocabulary detection who want incremental, implementable improvements on standard setups. It shows clear engagement with the literature and concrete benchmark claims, so it deserves a serious referee even if the consistency validation needs strengthening in revision.

Referee Report

2 major / 2 minor

Summary. The paper proposes a pseudo-labeling framework for open-vocabulary object detection (OVD) to overcome inaccurate region-level class labels from VLMs and biased objectness scores from base-class-trained RPNs. It introduces Hierarchical Confidence Calibration (HCC), which calibrates labels by measuring prediction consistency across class, super-category, and sub-category levels, and LoCLIP, a parameter-efficient CLIP adaptation that adds an objectness token for unbiased novel-class objectness estimation. The work claims new state-of-the-art results on standard OVD benchmarks including COCO and LVIS.

Significance. If the empirical claims and the validity of the core HCC assumption hold, this work would represent a meaningful advance in OVD by supplying a more principled mechanism for generating reliable pseudo-labels for novel classes. The hierarchical consistency idea and the lightweight LoCLIP adaptation with an explicit objectness token are technically interesting and could influence subsequent VLM-based detection pipelines. The parameter-efficient nature of LoCLIP is a particular strength that aligns with practical deployment constraints.

major comments (2)

[HCC subsection] HCC description (method section): The central claim that consistency of predictions across class/super/sub-category levels reliably indicates correct region-level labels for novel classes is load-bearing for the entire framework. The manuscript must include direct validation—e.g., plots or tables showing that higher consistency scores correlate with ground-truth accuracy on held-out novel classes—rather than relying solely on the consistency metric itself. Without this, correlated VLM biases across hierarchy levels could produce consistent-yet-incorrect pseudo-labels, as flagged by the stress-test concern.
[LoCLIP subsection] LoCLIP adaptation section: The claim that the added objectness token mitigates base-class bias without introducing new biases or overfitting requires explicit ablation evidence (e.g., objectness score distributions and detection AP breakdowns on novel vs. base classes before/after adaptation). The current high-level description does not address whether the token is trained on base classes only or how its parameters are initialized, leaving the unbiasedness claim unanchored.

minor comments (2)

[Abstract] Abstract: The sentence 'mitigate base class bias problem of RPNs' is grammatically incomplete; it should read 'mitigate the base-class bias problem of RPNs'.
[Discussion] The manuscript should add a dedicated limitations or failure-case subsection, given that the reader's report notes the absence of error analysis in the abstract-level description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive referee report. We appreciate the focus on validating the core assumptions behind HCC and LoCLIP. We address each major comment below and will revise the manuscript to incorporate the requested empirical evidence.

read point-by-point responses

Referee: [HCC subsection] HCC description (method section): The central claim that consistency of predictions across class/super/sub-category levels reliably indicates correct region-level labels for novel classes is load-bearing for the entire framework. The manuscript must include direct validation—e.g., plots or tables showing that higher consistency scores correlate with ground-truth accuracy on held-out novel classes—rather than relying solely on the consistency metric itself. Without this, correlated VLM biases across hierarchy levels could produce consistent-yet-incorrect pseudo-labels, as flagged by the stress-test concern.

Authors: We agree that direct validation of the HCC assumption is necessary. While the manuscript demonstrates end-to-end gains on COCO and LVIS, it does not currently include explicit correlation analysis between consistency scores and ground-truth accuracy on held-out novel classes. In the revision we will add plots and tables that quantify this correlation (e.g., accuracy vs. consistency bins on novel-class regions), thereby addressing the possibility of correlated VLM biases across hierarchy levels. revision: yes
Referee: [LoCLIP subsection] LoCLIP adaptation section: The claim that the added objectness token mitigates base-class bias without introducing new biases or overfitting requires explicit ablation evidence (e.g., objectness score distributions and detection AP breakdowns on novel vs. base classes before/after adaptation). The current high-level description does not address whether the token is trained on base classes only or how its parameters are initialized, leaving the unbiasedness claim unanchored.

Authors: We acknowledge that the current description of LoCLIP is high-level and lacks the requested ablations. In the revised manuscript we will add: (i) objectness score distributions before and after adaptation, (ii) novel-class vs. base-class AP breakdowns, and (iii) explicit details on training (base classes only) and initialization of the objectness token. These additions will substantiate that the adaptation reduces base-class bias without introducing new biases or overfitting. revision: yes

Circularity Check

0 steps flagged

No circularity: new calibration and adaptation steps are independent contributions

full rationale

The paper introduces HCC for cross-level consistency in pseudo-labeling and LoCLIP for unbiased objectness via an added token; these are presented as novel techniques whose effectiveness is shown via benchmark experiments rather than by reducing any claimed prediction or result to a prior fit, self-definition, or self-citation chain. No equations or steps in the provided abstract or description equate outputs to inputs by construction, and the central claims rest on empirical validation instead of imported uniqueness theorems or renamed known patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on domain assumptions about the utility of vision-language models for pseudo-labeling and the correlation between hierarchical consistency and label accuracy, plus the new LoCLIP entity whose effectiveness is not independently verified outside the paper.

axioms (2)

domain assumption Vision-language models can generate useful pseudo labels for novel object classes at the region level
Core premise of the OVD pseudo-labeling framework stated in the abstract.
ad hoc to paper Consistency of predictions across class, super-category, and sub-category levels indicates reliable region-level class assignments
Direct basis for the hierarchical confidence calibration (HCC) method.

invented entities (1)

LoCLIP no independent evidence
purpose: Parameter-efficient CLIP adaptation incorporating an objectness token to reduce base-class bias in region proposal networks
New model variant introduced to address unreliable objectness scores for novel classes.

pith-pipeline@v0.9.0 · 5541 in / 1503 out tokens · 85114 ms · 2026-05-08T08:35:45.196618+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 6 canonical work pages · 5 internal anchors

[1]

gpt-oss-120b & gpt-oss-20b Model Card

Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss- 20b model card. arXiv preprint arXiv:2508.10925, 2025. 1, 3, 4, 6

work page internal anchor Pith review arXiv 2025
[2]

Bridging the gap between object and image-level representations for open-vocabulary detection

Hanoona Bangalath, Muhammad Maaz, Muhammad Uzair Khattak, Salman H Khan, and Fahad Shahbaz Khan. Bridging the gap between object and image-level representations for open-vocabulary detection. In NeurIPS, 2022. 5, 7

2022
[3]

Microsoft COCO Captions: Data Collection and Evaluation Server

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedan- tam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015. 5, 7

work page internal anchor Pith review arXiv 2015
[4]

Open vocabulary object detection with pseudo bounding-box labels

Mingfei Gao, Chen Xing, Juan Carlos Niebles, Junnan Li, Ran Xu, Wenhao Liu, and Caiming Xiong. Open vocabulary object detection with pseudo bounding-box labels. In ECCV,
[5]

Simple copy-paste is a strong data augmentation method for instance segmentation

Golnaz Ghiasi, Yin Cui, Aravind Srinivas, Rui Qian, Tsung- Yi Lin, Ekin D Cubuk, Quoc V Le, and Barret Zoph. Simple copy-paste is a strong data augmentation method for instance segmentation. In CVPR, 2021. 7

2021
[6]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Ab- hinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review arXiv
[7]

Open- vocabulary object detection via vision and language knowl- edge distillation

Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. Open- vocabulary object detection via vision and language knowl- edge distillation. In ICLR, 2022. 7

2022
[8]

LVIS: A dataset for large vocabulary instance segmentation

Agrim Gupta, Piotr Dollar, and Ross Girshick. LVIS: A dataset for large vocabulary instance segmentation. In CVPR,
[9]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016. 7

2016
[10]

Mask R-CNN

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir- shick. Mask R-CNN. In ICCV, 2017. 7

2017
[11]

Open-vocabulary object detection via language hierarchy

Jiaxing Huang, Jingyi Zhang, Kai Jiang, and Shijian Lu. Open-vocabulary object detection via language hierarchy. NeurIPS, 2024. 5

2024
[12]

LLMs meet VLMs: Boost open vocabulary object detection with fine-grained descriptors

Sheng Jin, Xueying Jiang, Jiaxing Huang, Lewei Lu, and Shijian Lu. LLMs meet VLMs: Boost open vocabulary object detection with fine-grained descriptors. In ICLR, 2024. 7

2024
[13]

Retrieval-augmented open-vocabulary object detection

Jooyeon Kim, Eulrang Cho, Sehyung Kim, and Hyunwoo J Kim. Retrieval-augmented open-vocabulary object detection. In CVPR, 2024. 7

2024
[14]

F-VLM: Open-vocabulary object detection upon frozen vision and language models

Weicheng Kuo, Yin Cui, Xiuye Gu, AJ Piergiovanni, and Anelia Angelova. F-VLM: Open-vocabulary object detection upon frozen vision and language models. In ICLR, 2023. 7

2023
[15]

Align before fuse: Vision and language representation learning with momentum distillation

Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learning with momentum distillation. In NeurIPS, 2021. 3

2021
[16]

Distilling detr with visual-linguistic knowledge for open-vocabulary object detection

Liangqi Li, Jiaxu Miao, Dahu Shi, Wenming Tan, Ye Ren, Yi Yang, and Shiliang Pu. Distilling detr with visual-linguistic knowledge for open-vocabulary object detection. In ICCV,
[17]

CLIFF: Continual latent diffusion for open-vocabulary object detec- tion

Wuyang Li, Xinyu Liu, Jiayi Ma, and Yixuan Yuan. CLIFF: Continual latent diffusion for open-vocabulary object detec- tion. In ECCV, 2024. 5, 7

2024
[18]

Learning object-language alignments for open-vocabulary object detec- tion

Chuang Lin, Peize Sun, Yi Jiang, Ping Luo, Lizhen Qu, Gho- lamreza Haffari, Zehuan Yuan, and Jianfei Cai. Learning object-language alignments for open-vocabulary object detec- tion. In ICLR, 2023. 5, 7

2023
[19]

Microsoft COCO: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014. 1, 2, 3, 4, 6, 7, 8

2014
[20]

SHiNe: Semantic hierarchy nexus for open-vocabulary object detection

Mingxuan Liu, Tyler L Hayes, Elisa Ricci, Gabriela Csurka, and Riccardo V olpi. SHiNe: Semantic hierarchy nexus for open-vocabulary object detection. In CVPR, 2024. 2, 4

2024
[21]

CoDet: Co-occurrence guided region-word alignment for open-vocabulary object detection

Chuofan Ma, Yi Jiang, Xin Wen, Zehuan Yuan, and Xiaojuan Qi. CoDet: Co-occurrence guided region-word alignment for open-vocabulary object detection. In NeurIPS, 2023. 5

2023
[22]

Class-agnostic object detection with multi-modal trans- former

Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fa- had Shahbaz Khan, Rao Muhammad Anwer, and Ming-Hsuan Yang. Class-agnostic object detection with multi-modal trans- former. In ECCV, 2022. 4, 7

2022
[23]

WordNet: a lexical database for english

George A Miller. WordNet: a lexical database for english. ACM, 1995. 2

1995
[24]

CHiLS: Zero-shot image classification with hierarchical label sets

Zachary Novack, Julian McAuley, Zachary Chase Lipton, and Saurabh Garg. CHiLS: Zero-shot image classification with hierarchical label sets. In ICML, 2023. 2, 4

2023
[25]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, 2021. 3

2021
[26]

Faster R-CNN: Towards real-time object detection with region proposal networks

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. TPAMI, 2016. 1, 2, 4, 5, 7

2016
[27]

ImageNet-21K pretraining for the masses

Tal Ridnik, Emanuel Ben-Baruch, Asaf Noy, and Lihi Zelnik- Manor. ImageNet-21K pretraining for the masses. InNeurIPS,
[28]

Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018. 5, 7

2018
[29]

EVA-CLIP: Improved Training Techniques for CLIP at Scale

Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. EV A-CLIP: Improved training techniques for CLIP at scale. arXiv preprint arXiv:2303.15389, 2023. 7

work page internal anchor Pith review arXiv 2023
[30]

Gemma 3 Technical Report

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Ta- tiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. Gemma 3 technical report. arXiv preprint arXiv:2503.19786,

work page internal anchor Pith review arXiv
[31]

Marvelovd: Marrying object recognition and vision-language models for robust open-vocabulary object detection

Kuo Wang, Lechao Cheng, Weikai Chen, Pingping Zhang, Liang Lin, Fan Zhou, and Guanbin Li. Marvelovd: Marrying object recognition and vision-language models for robust open-vocabulary object detection. In ECCV, 2024. 2, 3, 4, 5, 7

2024
[32]

Object-aware dis- tillation pyramid for open-vocabulary object detection

Luting Wang, Yi Liu, Penghui Du, Zihan Ding, Yue Liao, Qiaosong Qi, Biaolong Chen, and Si Liu. Object-aware dis- tillation pyramid for open-vocabulary object detection. In CVPR, 2023. 2, 4, 5, 7

2023
[33]

Self-consistency improves chain of thought reasoning in lan- guage models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in lan- guage models. ICLR, 2023. 3

2023
[34]

Aligning bag of regions for open- vocabulary object detection

Size Wu, Wenwei Zhang, Sheng Jin, Wentao Liu, and Chen Change Loy. Aligning bag of regions for open- vocabulary object detection. In CVPR, 2023. 7

2023
[35]

ClipSelf: Vision trans- former distills itself for open-vocabulary dense prediction

Size Wu, Wenwei Zhang, Lumin Xu, Sheng Jin, Xiangtai Li, Wentao Liu, and Chen Change Loy. ClipSelf: Vision trans- former distills itself for open-vocabulary dense prediction. In ICLR, 2024. 7

2024
[36]

Open-vocabulary DETR with conditional matching

Yuhang Zang, Wei Li, Kaiyang Zhou, Chen Huang, and Chen Change Loy. Open-vocabulary DETR with conditional matching. In ECCV, 2022. 4, 7

2022
[37]

Exploiting unlabeled data with vision and language models for object detection

Shiyu Zhao, Zhixing Zhang, Samuel Schulter, Long Zhao, BG Vijay Kumar, Anastasis Stathopoulos, Manmohan Chan- draker, and Dimitris N Metaxas. Exploiting unlabeled data with vision and language models for object detection. In ECCV, 2022. 2, 4, 5, 7

2022
[38]

Taming self-training for open-vocabulary object detection

Shiyu Zhao, Samuel Schulter, Long Zhao, Zhixing Zhang, Yu- min Suh, Manmohan Chandraker, Dimitris N Metaxas, et al. Taming self-training for open-vocabulary object detection. In CVPR, 2024. 2, 5, 7

2024
[39]

RegionCLIP: Region-based language-image pretraining

Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, et al. RegionCLIP: Region-based language-image pretraining. In CVPR, 2022. 7

2022
[40]

Probabilistic two-stage detection,

Xingyi Zhou, Vladlen Koltun, and Philipp Krähenbühl. Proba- bilistic two-stage detection. arXiv preprint arXiv:2103.07461,

work page arXiv
[41]

Detecting twenty-thousand classes using image-level supervision

Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Krähen- bühl, and Ishan Misra. Detecting twenty-thousand classes using image-level supervision. In ECCV, 2022. 5, 7

2022
[42]

Deformable DETR: Deformable transformers for end-to-end object detection

Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable DETR: Deformable transformers for end-to-end object detection. In ICLR, 2021. 4, 7

2021