pith. machine review for the scientific record. sign in

arxiv: 2604.00503 · v2 · submitted 2026-04-01 · 💻 cs.CV

Recognition: no theorem link

PET-DINO: Unifying Visual Cues into Grounding DINO with Prompt-Enriched Training

Authors on Pith no claims yet

Pith reviewed 2026-05-13 22:16 UTC · model grok-4.3

classification 💻 cs.CV
keywords open-set object detectionvisual promptszero-shot detectionprompt-enriched traininguniversal detectorgrounding dino
0
0 comments X

The pith

PET-DINO adds an alignment-friendly visual prompt module to Grounding DINO and trains it with parallel and memory-driven strategies to support both text and visual inputs for open-set detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to build a single detector that recognizes novel object categories using either text descriptions or visual examples as prompts. It starts from an existing text-prompted model and inserts a generation step that turns visual cues into representations that align directly with text ones. Two training routines run prompts in parallel within batches and draw from a dynamic memory store across the full training process, allowing the model to learn multiple input formats at once. This approach matters because it avoids the heavy multi-modal engineering and staged optimization common in other visual-prompt methods while still reaching competitive accuracy on zero-shot tasks. If the claim holds, detectors become simpler to develop and more flexible for real-world cases where text alone falls short on rare or complex objects.

Core claim

PET-DINO inherits the base architecture from Grounding DINO and augments it with an Alignment-Friendly Visual Prompt Generation module that produces visual prompts compatible with text guidance, plus Intra-Batch Parallel Prompting at each training step and Dynamic Memory-Driven Prompting across the full schedule; together these let one model handle text and visual prompt routes simultaneously and deliver competitive zero-shot detection results on varied protocols.

What carries the argument

The Alignment-Friendly Visual Prompt Generation module, which converts raw visual examples into prompt embeddings that match the text representation space of the base detector, supported by the two prompt-enriched training strategies that enable simultaneous learning of multiple prompt types.

If this is right

  • A detector trained this way can switch between text and visual prompts at inference time without separate fine-tuning.
  • Training data usage improves because each batch simultaneously exercises text-only, visual-only, and mixed prompt paths.
  • Development time shortens since the method builds directly on an existing detector rather than redesigning the entire multi-modal pipeline.
  • Performance on specialized domains rises by using visual prompts to capture details that text descriptions miss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same prompt-enrichment idea could extend to video or 3D inputs by treating motion or depth clips as additional prompt sources.
  • Memory-driven prompting hints at a path toward continual adaptation where new visual examples update the detector on the fly.
  • If visual prompts prove especially strong for rare classes, future benchmarks might shift emphasis from text-only to mixed-prompt evaluation protocols.

Load-bearing premise

That the new visual prompt generation step plus the batch-parallel and memory-driven training routines can overcome the known limits of pure text representations without requiring the complex multi-modal fusion designs used in prior work.

What would settle it

Run PET-DINO and a multi-stage visual-prompt baseline on a held-out test set containing rare categories or intricate objects and check whether PET-DINO's zero-shot average precision remains within a few points of the baseline.

Figures

Figures reproduced from arXiv: 2604.00503 by Bin-Bin Gao, Chengjie Wang, Hanqiu Deng, Jialin Li, Jinyang Li, Weifu Fu, Wenbing Tao, Yong Liu, Yuhuan Lin.

Figure 1
Figure 1. Figure 1: Overall architecture of PET-DINO. Input coordinates undergo a Visual Prompt Generation process, interacting with enhanced [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Intra-Batch Parallel Prompting Diagram. Image and [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison between training from scratch and inherit [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 2
Figure 2. Figure 2: Feature correlation analysis between visual prompts and [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗
Figure 1
Figure 1. Figure 1: Dynamic Memory-Driven Prompting Diagram. During [PITH_FULL_IMAGE:figures/full_fig_p011_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: t-SNE visualization of visual prompt features showing [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Zero-shot detection visualizations of PET-DINO on interactive visual prompt-based detection in single-category dense object scenarios [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Zero-shot detection visualizations of PET-DINO on interactive visual prompt-based detection in multi-category dense object scenarios. 3 [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Zero-shot detection visualizations of PET-DINO on cross-image exemplar visual prompt-based detection. Exemplars are shown above, and prediction outputs are shown below [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Zero-shot detection visualizations of PET-DINO with class-level generic visual prompts. The visual prompt embeddings are pre-extracted from the training set. 4 [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Zero-shot detection visualizations of PET-DINO with text prompts. The category names from the dataset are utilized as textual inputs for prompt generation. 5 [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
read the original abstract

Open-Set Object Detection (OSOD) enables recognition of novel categories beyond fixed classes but faces challenges in aligning text representations with complex visual concepts and the scarcity of image-text pairs for rare categories. This results in suboptimal performance in specialized domains or with complex objects. Recent visual-prompted methods partially address these issues but often involve complex multi-modal designs and multi-stage optimizations, prolonging the development cycle. Additionally, effective training strategies for data-driven OSOD models remain largely unexplored. To address these challenges, we propose PET-DINO, a universal detector supporting both text and visual prompts. Our Alignment-Friendly Visual Prompt Generation (AFVPG) module builds upon an advanced text-prompted detector, addressing the limitations of text representation guidance and reducing the development cycle. We introduce two prompt-enriched training strategies: Intra-Batch Parallel Prompting (IBP) at the iteration level and Dynamic Memory-Driven Prompting (DMD) at the overall training level. These strategies enable simultaneous modeling of multiple prompt routes, facilitating parallel alignment with diverse real-world usage scenarios. Comprehensive experiments demonstrate that PET-DINO exhibits competitive zero-shot object detection capabilities across various prompt-based detection protocols. These strengths can be attributed to inheritance-based philosophy and prompt-enriched training strategies, which play a critical role in building an effective generic object detector. Project page: https://fuweifuvtoo.github.io/pet-dino.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes PET-DINO, a universal open-set object detector extending Grounding DINO to support both text and visual prompts. It introduces an Alignment-Friendly Visual Prompt Generation (AFVPG) module to mitigate text-representation limitations and two prompt-enriched training strategies—Intra-Batch Parallel Prompting (IBP) at the iteration level and Dynamic Memory-Driven Prompting (DMD) at the training level—to enable parallel modeling of multiple prompt routes. The central claim is that this inheritance-based design plus the new strategies yields competitive zero-shot detection performance across prompt-based protocols while avoiding the complex multi-modal architectures and multi-stage optimizations of prior work.

Significance. If the performance claims and attribution to the proposed modules hold after proper controls, the work could offer a simpler, more efficient path for prompt-based open-set detection by reusing strong text-prompted backbones and adding targeted alignment mechanisms. The focus on training strategies for data-driven OSOD addresses an underexplored aspect and may shorten development cycles for generic detectors.

major comments (2)
  1. [§4 Experiments and Table 2] §4 Experiments and Table 2: the central claim that AFVPG + IBP + DMD produce competitive zero-shot gains rests on unisolated improvements; no ablation row or control experiment trains unmodified Grounding DINO under identical data volume, epochs, and schedule, so attribution to the prompt-enriched strategies versus continued backbone training cannot be verified.
  2. [§3.2 AFVPG module] §3.2 AFVPG module: the assertion that the module 'reduces the development cycle' by addressing text limitations lacks a direct complexity comparison (parameter count, FLOPs, or stage count) against the multi-modal baselines criticized in the introduction; without this, the simplicity advantage remains unquantified.
minor comments (2)
  1. [Abstract] Abstract: states 'comprehensive experiments demonstrate competitive zero-shot capabilities' yet supplies no numerical metrics, baselines, or dataset names; adding one or two key numbers would improve clarity.
  2. [§3 Method] Notation: the acronyms AFVPG, IBP, and DMD are introduced without an initial glossary or consistent expansion on first use in the method section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comments point by point below, providing clarifications and committing to revisions that strengthen the experimental validation and complexity analysis without misrepresenting our contributions.

read point-by-point responses
  1. Referee: [§4 Experiments and Table 2] §4 Experiments and Table 2: the central claim that AFVPG + IBP + DMD produce competitive zero-shot gains rests on unisolated improvements; no ablation row or control experiment trains unmodified Grounding DINO under identical data volume, epochs, and schedule, so attribution to the prompt-enriched strategies versus continued backbone training cannot be verified.

    Authors: We acknowledge the validity of this concern: without a control that trains unmodified Grounding DINO under identical data volume, epochs, and schedule, it is difficult to fully isolate the contributions of AFVPG, IBP, and DMD from the effects of additional training. In the revised manuscript, we will add this control experiment as a new ablation row in Table 2 (and corresponding discussion in §4). This will train the original Grounding DINO backbone using the same prompt-enriched data and training schedule as PET-DINO, allowing direct attribution of gains to our inheritance-based design and prompt-enriched strategies. We believe this addition will confirm that the observed zero-shot improvements arise from the proposed modules rather than continued backbone optimization alone. revision: yes

  2. Referee: [§3.2 AFVPG module] §3.2 AFVPG module: the assertion that the module 'reduces the development cycle' by addressing text limitations lacks a direct complexity comparison (parameter count, FLOPs, or stage count) against the multi-modal baselines criticized in the introduction; without this, the simplicity advantage remains unquantified.

    Authors: We agree that a quantitative complexity comparison would better substantiate the claim of reduced development cycle. In the revision, we will insert a new table (in §3.2 or §4) reporting parameter counts, FLOPs, and number of training stages for PET-DINO versus the multi-modal baselines referenced in the introduction. This will explicitly show that AFVPG adds only lightweight components to the existing Grounding DINO backbone, avoiding the complex multi-modal architectures and multi-stage optimizations of prior work, thereby supporting the efficiency advantage of our inheritance-based approach. revision: yes

Circularity Check

0 steps flagged

No circularity: performance claims rest on proposed modules and external experiments, not self-referential derivations

full rationale

The paper proposes three new modules (AFVPG, IBP, DMD) built on an inherited Grounding DINO backbone and attributes competitive zero-shot detection to these additions plus prompt-enriched training. No equations, parameter fits, or uniqueness theorems appear in the provided text that reduce any claimed result to quantities defined by the authors' own inputs. The central argument is empirical, relying on comprehensive experiments across prompt protocols rather than any derivation chain that collapses by construction. Self-citations, if present in the full manuscript, are not load-bearing for the core claim, which remains independently testable via ablation and comparison to unmodified baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so concrete free parameters, axioms, and invented entities cannot be extracted. The work introduces new named modules and training procedures whose internal hyperparameters and assumptions remain unspecified.

pith-pipeline@v0.9.0 · 5577 in / 1111 out tokens · 41951 ms · 2026-05-13T22:16:16.497773+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 4 internal anchors

  1. [1]

    Cp-detr: Concept prompt guide detr to- ward stronger universal object detection

    Qibo Chen, Weizhong Jin, Jianyue Ge, Mengdi Liu, Yuchao Yan, Jian Jiang, Li Yu, Xuanjiang Guo, Shuchang Li, and Jianzhong Chen. Cp-detr: Concept prompt guide detr to- ward stronger universal object detection. InProceedings of the AAAI Conference on Artificial Intelligence, pages 2141– 2149, 2025. 1, 3, 5, 6

  2. [2]

    Delving into the trajectory long-tail distribution for muti-object track- ing

    Sijia Chen, En Yu, Jinyang Li, and Wenbing Tao. Delving into the trajectory long-tail distribution for muti-object track- ing. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 19341–19351,

  3. [3]

    Cross-view referring multi-object tracking

    Sijia Chen, En Yu, and Wenbing Tao. Cross-view referring multi-object tracking. InProceedings of the AAAI Confer- ence on Artificial Intelligence, pages 2204–2211, 2025

  4. [4]

    ReaMOT: A Benchmark and Framework for Reasoning-based Multi-Object Tracking

    Sijia Chen, Yanqiu Yu, En Yu, and Wenbing Tao. Reamot: A benchmark and framework for reasoning-based multi-object tracking.arXiv preprint arXiv:2505.20381, 2025. 2

  5. [5]

    Bert: Pre-training of deep bidirectional trans- formers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. InProceedings of the 2019 conference of the North American chapter of the asso- ciation for computational linguistics: human language tech- nologies, volume 1 (long and short papers), pages 4171– 4186, 2019. 5

  6. [6]

    WeiFu Fu, CongChong Nie, Ting Sun, Jun Liu, TianLiang Zhang, and Yong Liu. Lvis challenge track technical report 1st place solution: Distribution balanced and boundary re- finement for large vocabulary instance segmentation.arXiv preprint arXiv:2111.02668, 2021. 1

  7. [7]

    Open-vocabulary object detection via vision and language knowledge distillation,

    Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. Open-vocabulary object detection via vision and language knowledge distillation.arXiv preprint arXiv:2104.13921,

  8. [8]

    Lvis: A dataset for large vocabulary instance segmentation

    Agrim Gupta, Piotr Dollar, and Ross Girshick. Lvis: A dataset for large vocabulary instance segmentation. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5356–5364, 2019. 2, 5

  9. [9]

    T-rex: Counting by visual prompting.arXiv preprint arXiv:2311.13596, 2023

    Qing Jiang, Feng Li, Tianhe Ren, Shilong Liu, Zhaoyang Zeng, Kent Yu, and Lei Zhang. T-rex: Counting by visual prompting.arXiv preprint arXiv:2311.13596, 2023. 3

  10. [10]

    T-rex2: Towards generic object detec- tion via text-visual prompt synergy

    Qing Jiang, Feng Li, Zhaoyang Zeng, Tianhe Ren, Shilong Liu, and Lei Zhang. T-rex2: Towards generic object detec- tion via text-visual prompt synergy. InEuropean Conference on Computer Vision, pages 38–57. Springer, 2024. 1, 3, 4, 5, 6

  11. [11]

    Mdetr- modulated detection for end-to-end multi-modal understand- ing

    Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel Synnaeve, Ishan Misra, and Nicolas Carion. Mdetr- modulated detection for end-to-end multi-modal understand- ing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 1780–1790, 2021. 5

  12. [12]

    Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Ui- jlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, et al. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale.Interna- tional journal of computer vision, 128(7):1956–1981, 2020. 5

  13. [13]

    Elevater: A benchmark and toolkit for evaluating language-augmented visual models.Advances in Neural Information Processing Systems, 35:9287–9301,

    Chunyuan Li, Haotian Liu, Liunian Li, Pengchuan Zhang, Jyoti Aneja, Jianwei Yang, Ping Jin, Houdong Hu, Zicheng Liu, Yong Jae Lee, et al. Elevater: A benchmark and toolkit for evaluating language-augmented visual models.Advances in Neural Information Processing Systems, 35:9287–9301,

  14. [14]

    Visual in-context prompting

    Feng Li, Qing Jiang, Hao Zhang, Tianhe Ren, Shilong Liu, Xueyan Zou, Huaizhe Xu, Hongyang Li, Jianwei Yang, Chunyuan Li, et al. Visual in-context prompting. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12861–12871, 2024. 3, 7

  15. [15]

    Lors: Low-rank residual structure for parameter-efficient network stacking

    Jialin Li, Qiang Nie, Weifu Fu, Yuhuan Lin, Guangpin Tao, Yong Liu, and Chengjie Wang. Lors: Low-rank residual structure for parameter-efficient network stacking. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15866–15876, 2024. 1

  16. [16]

    Ovtr: End- to-end open-vocabulary multiple object tracking with trans- former.arXiv preprint arXiv:2503.10616, 2025

    Jinyang Li, En Yu, Sijia Chen, and Wenbing Tao. Ovtr: End- to-end open-vocabulary multiple object tracking with trans- former.arXiv preprint arXiv:2503.10616, 2025. 2

  17. [17]

    Grounded language-image pre-training

    Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jian- wei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10965–10975, 2022. 1, 2

  18. [18]

    Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection.Advances in neural information processing systems, 33:21002–21012, 2020

    Xiang Li, Wenhai Wang, Lijun Wu, Shuo Chen, Xiaolin Hu, Jun Li, Jinhui Tang, and Jian Yang. Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection.Advances in neural information processing systems, 33:21002–21012, 2020. 3

  19. [19]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 2, 5

  20. [20]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuro- pean conference on computer vision, pages 38–55. Springer,

  21. [21]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021. 5

  22. [22]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 5

  23. [23]

    Un- derstanding fine-tuning clip for open-vocabulary semantic segmentation in hyperbolic space

    Zelin Peng, Zhengqin Xu, Zhilin Zeng, Changsong Wen, Yu Huang, Menglin Yang, Feilong Tang, and Wei Shen. Un- derstanding fine-tuning clip for open-vocabulary semantic segmentation in hyperbolic space. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 4562–4572, 2025. 2

  24. [24]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, 9 Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2, 6

  25. [25]

    Generalized in- tersection over union: A metric and a loss for bounding box regression

    Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized in- tersection over union: A metric and a loss for bounding box regression. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 658–666,

  26. [26]

    Objects365: A large-scale, high-quality dataset for object detection

    Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Objects365: A large-scale, high-quality dataset for object detection. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 8430–8439, 2019. 5

  27. [27]

    Yoloe: Real-time seeing anything

    Ao Wang, Lihao Liu, Hui Chen, Zijia Lin, Jungong Han, and Guiguang Ding. Yoloe: Real-time seeing anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 24591–24602, 2025. 1, 3, 6

  28. [28]

    V3det: Vast vocabulary visual detection dataset

    Jiaqi Wang, Pan Zhang, Tao Chu, Yuhang Cao, Yujie Zhou, Tong Wu, Bin Wang, Conghui He, and Dahua Lin. V3det: Vast vocabulary visual detection dataset. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 19844–19854, 2023. 5

  29. [29]

    General object foundation model for images and videos at scale

    Junfeng Wu, Yi Jiang, Qihao Liu, Zehuan Yuan, Xiang Bai, and Song Bai. General object foundation model for images and videos at scale. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 3783–3795, 2024. 2

  30. [30]

    Multi-modal queried object detection in the wild.Advances in Neural Information Processing Systems, 36:4452–4469, 2023

    Yifan Xu, Mengdan Zhang, Chaoyou Fu, Peixian Chen, Xi- aoshan Yang, Ke Li, and Changsheng Xu. Multi-modal queried object detection in the wild.Advances in Neural Information Processing Systems, 36:4452–4469, 2023. 3

  31. [31]

    Detclip: Dictionary-enriched visual-concept paralleled pre- training for open-world detection.Advances in Neural Infor- mation Processing Systems, 35:9125–9138, 2022

    Lewei Yao, Jianhua Han, Youpeng Wen, Xiaodan Liang, Dan Xu, Wei Zhang, Zhenguo Li, Chunjing Xu, and Hang Xu. Detclip: Dictionary-enriched visual-concept paralleled pre- training for open-world detection.Advances in Neural Infor- mation Processing Systems, 35:9125–9138, 2022. 2

  32. [32]

    Detclipv3: To- wards versatile generative open-vocabulary object detection

    Lewei Yao, Renjie Pi, Jianhua Han, Xiaodan Liang, Hang Xu, Wei Zhang, Zhenguo Li, and Dan Xu. Detclipv3: To- wards versatile generative open-vocabulary object detection. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 27391–27401, 2024. 1

  33. [33]

    DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

    Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M Ni, and Heung-Yeung Shum. Dino: Detr with improved denoising anchor boxes for end-to-end object detection.arXiv preprint arXiv:2203.03605, 2022. 1

  34. [34]

    A simple framework for open-vocabulary segmentation and detection

    Hao Zhang, Feng Li, Xueyan Zou, Shilong Liu, Chunyuan Li, Jianwei Yang, and Lei Zhang. A simple framework for open-vocabulary segmentation and detection. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 1020–1031, 2023. 7

  35. [35]

    Just a few glances: Open-set visual perception with image prompt paradigm

    Jinrong Zhang, Penghui Wang, Chunxiao Liu, Wei Liu, Dian Jin, Qiong Zhang, Erli Meng, and Zhengnan Hu. Just a few glances: Open-set visual perception with image prompt paradigm. InProceedings of the AAAI Conference on Artifi- cial Intelligence, pages 9969–9976, 2025. 3, 7

  36. [36]

    An open and com- prehensive pipeline for unified object grounding and detec- tion.arXiv preprint arXiv:2401.02361, 2024

    Xiangyu Zhao, Yicheng Chen, Shilin Xu, Xiangtai Li, Xin- jiang Wang, Yining Li, and Haian Huang. An open and com- prehensive pipeline for unified object grounding and detec- tion.arXiv preprint arXiv:2401.02361, 2024. 5, 6, 7

  37. [37]

    Regionclip: Region- based language-image pretraining

    Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chun- yuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, et al. Regionclip: Region- based language-image pretraining. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16793–16803, 2022. 2

  38. [38]

    Deformable DETR: Deformable Transformers for End-to-End Object Detection

    Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable trans- formers for end-to-end object detection.arXiv preprint arXiv:2010.04159, 2020. 1, 4

  39. [39]

    Generalized decoding for pixel, image, and lan- guage

    Xueyan Zou, Zi-Yi Dou, Jianwei Yang, Zhe Gan, Linjie Li, Chunyuan Li, Xiyang Dai, Harkirat Behl, Jianfeng Wang, Lu Yuan, et al. Generalized decoding for pixel, image, and lan- guage. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 15116–15127,

  40. [40]

    More Implementation Details A.1

    7 10 PET-DINO: Unifying Visual Cues into Grounding DINO with Prompt-Enriched Training Supplementary Material A. More Implementation Details A.1. Detailed Illustration of DMD Here, we further provide a detailed illustration of Dynamic Memory-Driven (DMD) Prompting, as shown in Figure 1. At iterationt, n categories are sampled from the Visual Cues Bank. For...