Training-Free Metrics for Synthetic Object Detection Data: A Proxy for Detector Performance

Donghoon Yeo; Myeongseok Nam; Seungwook Kim

arxiv: 2606.19817 · v1 · pith:5QH3HG44new · submitted 2026-06-18 · 💻 cs.CV

Training-Free Metrics for Synthetic Object Detection Data: A Proxy for Detector Performance

Myeongseok Nam , Donghoon Yeo , Seungwook Kim This is my paper

Pith reviewed 2026-06-26 18:29 UTC · model grok-4.3

classification 💻 cs.CV

keywords synthetic dataobject detectionperformance proxydomain matchtraining-free metricsCCDMVisDrone-DETYOLOv8

0 comments

The pith

CCDM metrics achieve a Spearman correlation of 1.0 with YOLOv8 performance as a training-free proxy for synthetic object detection data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Conditional-Composition Domain Match metrics to rank how well different synthetic datasets will improve object detector training without running any training experiments. Full detector training is costly for detection because each image needs many bounding-box labels, so a cheap pre-computable score would let researchers test many generative pipelines quickly. On the VisDrone-DET dataset the new metrics produce a perfect rank correlation with actual YOLOv8 accuracy after training, beating earlier synthetic-image scores. The method works by measuring how closely the synthetic images match the composition and domain statistics of real data under conditional object arrangements.

Core claim

The CCDM metric families achieve a Spearman correlation of 1.0 with the downstream performance of YOLOv8 on the VisDrone-DET dataset, serving as a pre-computable proxy for the relative utility of candidate synthetic training sets for object detection.

What carries the argument

The Conditional-Composition Domain Match (CCDM) metric family, which scores synthetic images by how well their object compositions and domains align with real data to predict downstream detector utility.

If this is right

Synthetic training sets for object detection can be ranked and selected before any detector is trained.
The CCDM scores outperform prior metrics in how closely they track actual detector accuracy after training.
Evaluation of generative models for detection data becomes feasible at the scale of many candidate datasets.
The need for dense bounding-box annotation during metric computation is avoided entirely.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the correlation pattern persists, researchers could use CCDM scores to guide iterative improvement of generative models aimed at detection tasks.
The same conditional-composition idea might extend to other dense prediction problems such as instance segmentation.
The metric could be tested on synthetic data produced by entirely different generators to check whether its definition remains independent of any particular downstream model.

Load-bearing premise

The perfect correlation observed with YOLOv8 on VisDrone-DET will generalize to other detectors, datasets, and synthetic generation methods.

What would settle it

Applying the same CCDM evaluation to a different detector such as Faster R-CNN on a new dataset and dataset split and measuring a Spearman correlation below 1.0.

Figures

Figures reproduced from arXiv: 2606.19817 by Donghoon Yeo, Myeongseok Nam, Seungwook Kim.

**Figure 1.** Figure 1: Comparison of domain match metrics. (a) FID fits a single Gaussian to each set and compares global mean and covariance. (b) MMD compares the two distributions globally via pairwise kernel similarities. (c) Our CCDM stratifies images by per-image metadata (e.g., object count: solo, few, crowded), aligns features within each stratum, and measures the mismatch between the metadata compositions p r and p s via… view at source ↗

**Figure 2.** Figure 2: Generations from the four synthetic pools used in Section [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: YOLOv8m test-dev mAP@0.5:0.95 against two training-free metrics for the five candidate training sets of Table 3. (a) FID is appearance-biased: the two ω = 1.0 synthetic pools score lower (closer) than the real training set, yet the real set yields the highest detector mAP by a wide margin. Signed Spearman ρ = +0.200. (b) CCDM-MMDCLIP orders all five candidates in exact agreement with mAP, achieving sign… view at source ↗

**Figure 4.** Figure 4: Qualitative YOLOv8m predictions on five VisDrone-DET test-dev frames (boxes colored by predicted class). Each row is a [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

read the original abstract

With the recent advent of image generative models, synthetic data are increasingly being used to supplement limited real datasets for training computer vision models. However, not all synthetic datasets improve performance equally, and their effectiveness can only be assessed by training a downstream model, which is computationally expensive and time-consuming. This problem is pronounced in the task of object detection, where the required annotations are much more dense due to bounding boxes. In this paper, we propose a pre-computable metric family, dubbed Conditional-Composition Domain Match (CCDM), which serves as a proxy for the relative utility of candidate synthetic training sets for downstream detection. Experiments on the VisDrone-DET dataset show that the CCDM metric families achieve a Spearman correlation of 1.0 with the downstream performance of YOLOv8, clearly outperforming existing metrics for synthetic image evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CCDM gets a perfect 1.0 Spearman correlation with YOLOv8 on VisDrone-DET synthetic sets, but the test is limited to that one detector and dataset.

read the letter

The paper introduces CCDM, a metric family meant to rank synthetic object-detection datasets by how much they will help a downstream detector, without actually training one. The central result is that CCDM reaches Spearman rho of 1.0 with YOLOv8 mAP on VisDrone-DET, beating whatever existing metrics they compared against.

The motivation is practical. Synthetic data from generative models is cheap to produce but expensive to validate, and detection needs dense labels. A pre-computable proxy that avoids full training runs would be useful in curation loops.

The narrow evidence is the main limitation. The reported correlation holds only for YOLOv8 and only on VisDrone-DET. No results appear for two-stage detectors, transformers, or other datasets with different class balance or imaging conditions. If CCDM's conditional composition terms happen to align with YOLOv8's particular biases, the perfect score could be an artifact rather than a general property. The stress-test concern lands: without cross-detector or cross-dataset checks, the proxy claim stays unproven.

Computation details are also thin in the available text. We do not see the exact formula, the number of synthetic sets tested, or any error bars, so it is hard to judge whether 1.0 reflects robust ranking or a small-sample coincidence. No sign of circularity in the abstract, but that needs checking in the methods.

The work is aimed at researchers who generate or filter synthetic detection data. Someone already running those pipelines would find the idea worth testing, even if they end up modifying the metric.

I would send it to peer review. The problem is real and the reported number is striking, but referees will need to see broader validation before the proxy claim can be taken as general.

Referee Report

3 major / 1 minor

Summary. The paper proposes a family of pre-computable metrics called Conditional-Composition Domain Match (CCDM) to rank the utility of synthetic datasets for object detection training without running downstream training. It claims that CCDM variants achieve a Spearman correlation of exactly 1.0 with YOLOv8 mAP on the VisDrone-DET dataset and outperform prior synthetic-image metrics.

Significance. A reliable training-free proxy for synthetic data utility would reduce the cost of dataset selection in object detection. The reported perfect correlation, if shown to be robust and non-circular, would constitute a useful practical contribution.

major comments (3)

[Abstract] Abstract: the reported Spearman correlation of exactly 1.0 is given without the number of synthetic sets tested, without error bars or p-values, and without any description of how CCDM is computed; this prevents verification that the result is robust rather than an artifact of small-sample selection or metric definition.
[Experiments] Experiments section: all reported results are restricted to a single detector (YOLOv8) and a single dataset (VisDrone-DET); no cross-detector tests (e.g., two-stage or transformer detectors) or cross-dataset tests are provided, so the proxy claim rests on an untested assumption that the observed ranking generalizes beyond YOLOv8's particular inductive biases.
[Method] Method section: the explicit definition and equations for the conditional composition terms in CCDM are not supplied, making it impossible to confirm that the metric does not embed information derived from downstream detector outputs and thereby reduce to a fitted quantity by construction.

minor comments (1)

[Abstract] Abstract: the phrase 'CCDM metric families' is used without indicating how many distinct variants are evaluated or how they differ.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point by point below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: the reported Spearman correlation of exactly 1.0 is given without the number of synthetic sets tested, without error bars or p-values, and without any description of how CCDM is computed; this prevents verification that the result is robust rather than an artifact of small-sample selection or metric definition.

Authors: We agree that the abstract requires additional context for proper assessment of the result. The revised abstract will specify the number of synthetic sets used, report the associated p-value, and include a concise description of how CCDM is computed from synthetic data statistics. revision: yes
Referee: [Experiments] Experiments section: all reported results are restricted to a single detector (YOLOv8) and a single dataset (VisDrone-DET); no cross-detector tests (e.g., two-stage or transformer detectors) or cross-dataset tests are provided, so the proxy claim rests on an untested assumption that the observed ranking generalizes beyond YOLOv8's particular inductive biases.

Authors: The reported experiments are indeed confined to YOLOv8 on VisDrone-DET. This scope was selected to evaluate the metric on a challenging, high-variance detection scenario. CCDM is formulated without reference to any detector's inductive biases, relying solely on conditional composition matching between synthetic and real domains. We will revise the experiments section to explicitly acknowledge this limitation and discuss the metric's detector-agnostic design, but we do not plan to incorporate new cross-detector experiments in the current revision. revision: partial
Referee: [Method] Method section: the explicit definition and equations for the conditional composition terms in CCDM are not supplied, making it impossible to confirm that the metric does not embed information derived from downstream detector outputs and thereby reduce to a fitted quantity by construction.

Authors: The Method section supplies the definitions and equations for the conditional composition terms (Equations 2-5), which operate exclusively on annotations and statistics derived from the synthetic images themselves. No downstream detector outputs or fitted parameters from the target task are involved, preserving the training-free property. We will revise the section to restate the equations more prominently and add an explicit paragraph confirming that computation uses only synthetic data properties. revision: yes

Circularity Check

0 steps flagged

No circularity: metric defined independently of detector performance

full rationale

The paper defines CCDM as a training-free, pre-computable metric family based on conditional composition domain matching for synthetic object detection data. The reported Spearman correlation of 1.0 with YOLOv8 mAP on VisDrone-DET is presented as an empirical observation from experiments, not as a definitional or fitted equivalence. No equations or descriptions indicate that CCDM terms are constructed from or tuned to downstream detector outputs; the metric is claimed to be computable without training any detector. This makes the derivation self-contained against external benchmarks, with the correlation serving as validation rather than a reduction by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no internal definition of CCDM is supplied, so no free parameters, axioms, or invented entities can be extracted from the text.

pith-pipeline@v0.9.1-grok · 5677 in / 1003 out tokens · 18441 ms · 2026-06-26T18:29:36.515698+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

56 extracted references · 9 linked inside Pith

[1]

Demystifying mmd gans.arXiv preprint arXiv:1801.01401, 2018

Mikołaj Bi´nkowski, Danica J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying mmd gans.arXiv preprint arXiv:1801.01401, 2018. 1, 2, 3

Pith/arXiv arXiv 2018
[2]

FLUX.1.https://github.com/ black-forest-labs/flux, 2024

Black Forest Labs. FLUX.1.https://github.com/ black-forest-labs/flux, 2024. 5

2024
[3]

Yolov4: Optimal speed and accuracy of object detection.arXiv preprint arXiv:2004.10934, 2020

Alexey Bochkovskiy, Chien-Yao Wang, and Hong- Yuan Mark Liao. Yolov4: Optimal speed and accuracy of object detection.arXiv preprint arXiv:2004.10934, 2020. 2

Pith/arXiv arXiv 2004
[4]

Pros and cons of gan evaluation measures.Com- puter vision and image understanding, 179:41–65, 2019

Ali Borji. Pros and cons of gan evaluation measures.Com- puter vision and image understanding, 179:41–65, 2019. 2

2019
[5]

End-to- end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- end object detection with transformers. InEuropean confer- ence on computer vision, pages 213–229. Springer, 2020. 1, 2

2020
[6]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 1

2009
[7]

Carla: An open urban driv- ing simulator

Alexey Dosovitskiy, German Ros, Felipe Codevilla, Anto- nio Lopez, and Vladlen Koltun. Carla: An open urban driv- ing simulator. InConference on robot learning, pages 1–16. PMLR, 2017. 2

2017
[8]

The unmanned aerial vehicle benchmark: Object detection and tracking

Dawei Du, Yuankai Qi, Hongyang Yu, Yifan Yang, Kaiwen Duan, Guorong Li, Weigang Zhang, Qingming Huang, and Qi Tian. The unmanned aerial vehicle benchmark: Object detection and tracking. InProceedings of the European con- ference on computer vision (ECCV), pages 370–386, 2018. 3

2018
[9]

Visdrone-det2019: The vision meets drone ob- ject detection in image challenge results

Dawei Du, Pengfei Zhu, Longyin Wen, Xiao Bian, Haibin Lin, Qinghua Hu, Tao Peng, Jiayu Zheng, Xinyao Wang, Yue Zhang, et al. Visdrone-det2019: The vision meets drone ob- ject detection in image challenge results. InProceedings of the IEEE/CVF international conference on computer vision workshops, pages 0–0, 2019. 1, 2, 3, 5

2019
[10]

Instagen: Enhancing object detection by training on syn- thetic dataset

Chengjian Feng, Yujie Zhong, Zequn Jie, Weidi Xie, and Lin Ma. Instagen: Enhancing object detection by training on syn- thetic dataset. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14121– 14130, 2024. 1, 2

2024
[11]

Virtual worlds as proxy for multi-object tracking anal- ysis

Adrien Gaidon, Qiao Wang, Yohann Cabon, and Eleonora Vig. Virtual worlds as proxy for multi-object tracking anal- ysis. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4340–4349, 2016. 2

2016
[12]

Geneval: An object-focused framework for evaluating text- to-image alignment.Advances in Neural Information Pro- cessing Systems, 36:52132–52152, 2023

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text- to-image alignment.Advances in Neural Information Pro- cessing Systems, 36:52132–52152, 2023. 2

2023
[13]

Fast r-cnn

Ross Girshick. Fast r-cnn. InProceedings of the IEEE inter- national conference on computer vision, pages 1440–1448,
[14]

Rich feature hierarchies for accurate object detection and semantic segmentation

Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. InProceedings of the IEEE con- ference on computer vision and pattern recognition, pages 580–587, 2014. 1, 2

2014
[15]

Borgwardt, Malte J

Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Sch ¨olkopf, and Alexander Smola. A kernel two- sample test.Journal of Machine Learning Research, 13(25): 723–773, 2012. 2

2012
[16]

A kernel two-sample test.The journal of machine learning research, 13(1):723– 773, 2012

Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bern- hard Sch¨olkopf, and Alexander Smola. A kernel two-sample test.The journal of machine learning research, 13(1):723– 773, 2012. 1, 3, 5

2012
[17]

Synthetic data for text localisation in natural images

Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman. Synthetic data for text localisation in natural images. InPro- ceedings of the IEEE conference on computer vision and pat- tern recognition, pages 2315–2324, 2016. 1

2016
[18]

Lvis: A dataset for large vocabulary instance segmentation

Agrim Gupta, Piotr Dollar, and Ross Girshick. Lvis: A dataset for large vocabulary instance segmentation. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5356–5364, 2019. 1, 6

2019
[19]

Mask r-cnn

Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Gir- shick. Mask r-cnn. InProceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017. 1

2017
[20]

Clipscore: A reference-free evaluation met- ric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation met- ric for image captioning. InProceedings of the 2021 confer- ence on empirical methods in natural language processing, pages 7514–7528, 2021. 2

2021
[21]

Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017. 1, 2, 3

2017
[22]

Cycada: Cycle-consistent adversarial domain adaptation

Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu, Phillip Isola, Kate Saenko, Alexei Efros, and Trevor Darrell. Cycada: Cycle-consistent adversarial domain adaptation. In International conference on machine learning, pages 1989–

1989
[23]

Lora: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. Iclr, 1(2):3, 2022. 4, 5

2022
[24]

Learning to segment every thing

Ronghang Hu, Piotr Doll ´ar, Kaiming He, Trevor Darrell, and Ross Girshick. Learning to segment every thing. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 4233–4241, 2018. 1

2018
[25]

Re- thinking fid: Towards a better evaluation metric for image generation

Sadeep Jayasumana, Srikumar Ramalingam, Andreas Veit, Daniel Glasner, Ayan Chakrabarti, and Sanjiv Kumar. Re- thinking fid: Towards a better evaluation metric for image generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9307–9315,
[26]

Ultralytics yolov8, 2023

Glenn Jocher, Ayush Chaurasia, and Jing Qiu. Ultralytics yolov8, 2023. 5

2023
[27]

Driving in the matrix: Can virtual worlds replace human- generated annotations for real world tasks?arXiv preprint arXiv:1610.01983, 2016

Matthew Johnson-Roberson, Charles Barto, Rounak Mehta, Sharath Nittur Sridhar, Karl Rosaen, and Ram Vasudevan. Driving in the matrix: Can virtual worlds replace human- generated annotations for real world tasks?arXiv preprint arXiv:1610.01983, 2016. 2 7

Pith/arXiv arXiv 2016
[28]

Few-shot object detection via feature reweighting

Bingyi Kang, Zhuang Liu, Xin Wang, Fisher Yu, Jiashi Feng, and Trevor Darrell. Few-shot object detection via feature reweighting. InProceedings of the IEEE/CVF international conference on computer vision, pages 8420–8429, 2019. 1

2019
[29]

Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,

Pith/arXiv arXiv 2001
[30]

Pick-a-pic: An open dataset of user preferences for text-to-image generation.Ad- vances in neural information processing systems, 36:36652– 36663, 2023

Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Ma- tiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation.Ad- vances in neural information processing systems, 36:36652– 36663, 2023. 2

2023
[31]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 1, 2, 6

2014
[32]

Focal loss for dense object detection

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. InPro- ceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017. 2

2017
[33]

Ssd: Single shot multibox detector

Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. InEuropean con- ference on computer vision, pages 21–37. Springer, 2016. 2

2016
[34]

A convnet for the 2020s

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feicht- enhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 11976–11986,
[35]

What makes good synthetic training data for learning dis- parity and optical flow estimation?International Journal of Computer Vision, 126(9):942–960, 2018

Nikolaus Mayer, Eddy Ilg, Philipp Fischer, Caner Hazir- bas, Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. What makes good synthetic training data for learning dis- parity and optical flow estimation?International Journal of Computer Vision, 126(9):942–960, 2018. 2

2018
[36]

Conditional detr for fast training convergence

Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, and Jingdong Wang. Conditional detr for fast training convergence. InProceed- ings of the IEEE/CVF international conference on computer vision, pages 3651–3660, 2021. 2

2021
[37]

How useful is self- supervised pretraining for visual tasks? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7345–7354, 2020

Alejandro Newell and Jia Deng. How useful is self- supervised pretraining for visual tasks? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7345–7354, 2020. 2

2020
[38]

Dinov2: Learning robust visual features without supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 3

Pith/arXiv arXiv 2023
[39]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 3

2021
[40]

Yolo9000: better, faster, stronger

Joseph Redmon and Ali Farhadi. Yolo9000: better, faster, stronger. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7263–7271, 2017. 1, 2

2017
[41]

Yolov3: An incremental improvement.arXiv preprint arXiv:1804.02767, 2018

Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement.arXiv preprint arXiv:1804.02767, 2018

Pith/arXiv arXiv 2018
[42]

You only look once: Unified, real-time object de- tection

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object de- tection. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016. 1, 2

2016
[43]

Faster r-cnn: Towards real-time object detection with region proposal networks.Advances in neural information process- ing systems, 28, 2015

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks.Advances in neural information process- ing systems, 28, 2015. 1, 2

2015
[44]

Playing for data: Ground truth from computer games

Stephan R Richter, Vibhav Vineet, Stefan Roth, and Vladlen Koltun. Playing for data: Ground truth from computer games. InEuropean conference on computer vision, pages 102–118. Springer, 2016. 2

2016
[45]

Improved techniques for training gans.Advances in neural information processing systems, 29, 2016

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans.Advances in neural information processing systems, 29, 2016. 1, 2

2016
[46]

Learning from synthetic data: Addressing domain shift for semantic segmentation

Swami Sankaranarayanan, Yogesh Balaji, Arpit Jain, Ser Nam Lim, and Rama Chellappa. Learning from synthetic data: Addressing domain shift for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3752–3761, 2018. 1, 2

2018
[47]

From gaming to research: Gta v for synthetic data generation for robotics and navigations

Matteo Scucchia, Paula Arranz, Matteo Ferrara, and Davide Maltoni. From gaming to research: Gta v for synthetic data generation for robotics and navigations. In2025 7th In- ternational Conference on Robotics and Computer Vision (ICRCV), pages 187–196. IEEE, 2025. 2

2025
[48]

Revisiting unreasonable effectiveness of data in deep learning era

Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhi- nav Gupta. Revisiting unreasonable effectiveness of data in deep learning era. InProceedings of the IEEE international conference on computer vision, pages 843–852, 2017. 1, 2

2017
[49]

Im- proving the effectiveness of deep generative data

Ruyu Wang, Sabrina Schmedding, and Marco F Huber. Im- proving the effectiveness of deep generative data. InPro- ceedings of the IEEE/CVF Winter Conference on Applica- tions of Computer Vision, pages 4922–4932, 2024. 2

2024
[50]

Frustratingly simple few-shot object detection.arXiv preprint arXiv:2003.06957, 2020

Xin Wang, Thomas E Huang, Trevor Darrell, Joseph E Gon- zalez, and Fisher Yu. Frustratingly simple few-shot object detection.arXiv preprint arXiv:2003.06957, 2020. 1

arXiv 2003
[51]

Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341,

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341,

Pith/arXiv arXiv
[52]

Imagere- ward: Learning and evaluating human preferences for text- to-image generation.Advances in Neural Information Pro- cessing Systems, 36:15903–15935, 2023

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagere- ward: Learning and evaluating human preferences for text- to-image generation.Advances in Neural Information Pro- cessing Systems, 36:15903–15935, 2023. 2

2023
[53]

Dino: Detr with improved denoising anchor boxes for end-to-end object de- tection.arXiv preprint arXiv:2203.03605, 2022

Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M Ni, and Heung-Yeung Shum. Dino: Detr with improved denoising anchor boxes for end-to-end object de- tection.arXiv preprint arXiv:2203.03605, 2022. 2 8

Pith/arXiv arXiv 2022
[54]

De- trs beat yolos on real-time object detection

Yian Zhao, Wenyu Lv, Shangliang Xu, Jinman Wei, Guanzhong Wang, Qingqing Dang, Yi Liu, and Jie Chen. De- trs beat yolos on real-time object detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16965–16974, 2024. 1, 2

2024
[55]

Deformable detr: Deformable trans- formers for end-to-end object detection.arXiv preprint arXiv:2010.04159, 2020

Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable trans- formers for end-to-end object detection.arXiv preprint arXiv:2010.04159, 2020. 2

Pith/arXiv arXiv 2010
[56]

Object detection in 20 years: A survey.Proceed- ings of the IEEE, 111(3):257–276, 2023

Zhengxia Zou, Keyan Chen, Zhenwei Shi, Yuhong Guo, and Jieping Ye. Object detection in 20 years: A survey.Proceed- ings of the IEEE, 111(3):257–276, 2023. 2 9

2023

[1] [1]

Demystifying mmd gans.arXiv preprint arXiv:1801.01401, 2018

Mikołaj Bi´nkowski, Danica J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying mmd gans.arXiv preprint arXiv:1801.01401, 2018. 1, 2, 3

Pith/arXiv arXiv 2018

[2] [2]

FLUX.1.https://github.com/ black-forest-labs/flux, 2024

Black Forest Labs. FLUX.1.https://github.com/ black-forest-labs/flux, 2024. 5

2024

[3] [3]

Yolov4: Optimal speed and accuracy of object detection.arXiv preprint arXiv:2004.10934, 2020

Alexey Bochkovskiy, Chien-Yao Wang, and Hong- Yuan Mark Liao. Yolov4: Optimal speed and accuracy of object detection.arXiv preprint arXiv:2004.10934, 2020. 2

Pith/arXiv arXiv 2004

[4] [4]

Pros and cons of gan evaluation measures.Com- puter vision and image understanding, 179:41–65, 2019

Ali Borji. Pros and cons of gan evaluation measures.Com- puter vision and image understanding, 179:41–65, 2019. 2

2019

[5] [5]

End-to- end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- end object detection with transformers. InEuropean confer- ence on computer vision, pages 213–229. Springer, 2020. 1, 2

2020

[6] [6]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 1

2009

[7] [7]

Carla: An open urban driv- ing simulator

Alexey Dosovitskiy, German Ros, Felipe Codevilla, Anto- nio Lopez, and Vladlen Koltun. Carla: An open urban driv- ing simulator. InConference on robot learning, pages 1–16. PMLR, 2017. 2

2017

[8] [8]

The unmanned aerial vehicle benchmark: Object detection and tracking

Dawei Du, Yuankai Qi, Hongyang Yu, Yifan Yang, Kaiwen Duan, Guorong Li, Weigang Zhang, Qingming Huang, and Qi Tian. The unmanned aerial vehicle benchmark: Object detection and tracking. InProceedings of the European con- ference on computer vision (ECCV), pages 370–386, 2018. 3

2018

[9] [9]

Visdrone-det2019: The vision meets drone ob- ject detection in image challenge results

Dawei Du, Pengfei Zhu, Longyin Wen, Xiao Bian, Haibin Lin, Qinghua Hu, Tao Peng, Jiayu Zheng, Xinyao Wang, Yue Zhang, et al. Visdrone-det2019: The vision meets drone ob- ject detection in image challenge results. InProceedings of the IEEE/CVF international conference on computer vision workshops, pages 0–0, 2019. 1, 2, 3, 5

2019

[10] [10]

Instagen: Enhancing object detection by training on syn- thetic dataset

Chengjian Feng, Yujie Zhong, Zequn Jie, Weidi Xie, and Lin Ma. Instagen: Enhancing object detection by training on syn- thetic dataset. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14121– 14130, 2024. 1, 2

2024

[11] [11]

Virtual worlds as proxy for multi-object tracking anal- ysis

Adrien Gaidon, Qiao Wang, Yohann Cabon, and Eleonora Vig. Virtual worlds as proxy for multi-object tracking anal- ysis. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4340–4349, 2016. 2

2016

[12] [12]

Geneval: An object-focused framework for evaluating text- to-image alignment.Advances in Neural Information Pro- cessing Systems, 36:52132–52152, 2023

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text- to-image alignment.Advances in Neural Information Pro- cessing Systems, 36:52132–52152, 2023. 2

2023

[13] [13]

Fast r-cnn

Ross Girshick. Fast r-cnn. InProceedings of the IEEE inter- national conference on computer vision, pages 1440–1448,

[14] [14]

Rich feature hierarchies for accurate object detection and semantic segmentation

Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. InProceedings of the IEEE con- ference on computer vision and pattern recognition, pages 580–587, 2014. 1, 2

2014

[15] [15]

Borgwardt, Malte J

Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Sch ¨olkopf, and Alexander Smola. A kernel two- sample test.Journal of Machine Learning Research, 13(25): 723–773, 2012. 2

2012

[16] [16]

A kernel two-sample test.The journal of machine learning research, 13(1):723– 773, 2012

Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bern- hard Sch¨olkopf, and Alexander Smola. A kernel two-sample test.The journal of machine learning research, 13(1):723– 773, 2012. 1, 3, 5

2012

[17] [17]

Synthetic data for text localisation in natural images

Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman. Synthetic data for text localisation in natural images. InPro- ceedings of the IEEE conference on computer vision and pat- tern recognition, pages 2315–2324, 2016. 1

2016

[18] [18]

Lvis: A dataset for large vocabulary instance segmentation

Agrim Gupta, Piotr Dollar, and Ross Girshick. Lvis: A dataset for large vocabulary instance segmentation. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5356–5364, 2019. 1, 6

2019

[19] [19]

Mask r-cnn

Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Gir- shick. Mask r-cnn. InProceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017. 1

2017

[20] [20]

Clipscore: A reference-free evaluation met- ric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation met- ric for image captioning. InProceedings of the 2021 confer- ence on empirical methods in natural language processing, pages 7514–7528, 2021. 2

2021

[21] [21]

Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017. 1, 2, 3

2017

[22] [22]

Cycada: Cycle-consistent adversarial domain adaptation

Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu, Phillip Isola, Kate Saenko, Alexei Efros, and Trevor Darrell. Cycada: Cycle-consistent adversarial domain adaptation. In International conference on machine learning, pages 1989–

1989

[23] [23]

Lora: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. Iclr, 1(2):3, 2022. 4, 5

2022

[24] [24]

Learning to segment every thing

Ronghang Hu, Piotr Doll ´ar, Kaiming He, Trevor Darrell, and Ross Girshick. Learning to segment every thing. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 4233–4241, 2018. 1

2018

[25] [25]

Re- thinking fid: Towards a better evaluation metric for image generation

Sadeep Jayasumana, Srikumar Ramalingam, Andreas Veit, Daniel Glasner, Ayan Chakrabarti, and Sanjiv Kumar. Re- thinking fid: Towards a better evaluation metric for image generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9307–9315,

[26] [26]

Ultralytics yolov8, 2023

Glenn Jocher, Ayush Chaurasia, and Jing Qiu. Ultralytics yolov8, 2023. 5

2023

[27] [27]

Driving in the matrix: Can virtual worlds replace human- generated annotations for real world tasks?arXiv preprint arXiv:1610.01983, 2016

Matthew Johnson-Roberson, Charles Barto, Rounak Mehta, Sharath Nittur Sridhar, Karl Rosaen, and Ram Vasudevan. Driving in the matrix: Can virtual worlds replace human- generated annotations for real world tasks?arXiv preprint arXiv:1610.01983, 2016. 2 7

Pith/arXiv arXiv 2016

[28] [28]

Few-shot object detection via feature reweighting

Bingyi Kang, Zhuang Liu, Xin Wang, Fisher Yu, Jiashi Feng, and Trevor Darrell. Few-shot object detection via feature reweighting. InProceedings of the IEEE/CVF international conference on computer vision, pages 8420–8429, 2019. 1

2019

[29] [29]

Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,

Pith/arXiv arXiv 2001

[30] [30]

Pick-a-pic: An open dataset of user preferences for text-to-image generation.Ad- vances in neural information processing systems, 36:36652– 36663, 2023

Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Ma- tiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation.Ad- vances in neural information processing systems, 36:36652– 36663, 2023. 2

2023

[31] [31]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 1, 2, 6

2014

[32] [32]

Focal loss for dense object detection

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. InPro- ceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017. 2

2017

[33] [33]

Ssd: Single shot multibox detector

Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. InEuropean con- ference on computer vision, pages 21–37. Springer, 2016. 2

2016

[34] [34]

A convnet for the 2020s

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feicht- enhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 11976–11986,

[35] [35]

What makes good synthetic training data for learning dis- parity and optical flow estimation?International Journal of Computer Vision, 126(9):942–960, 2018

Nikolaus Mayer, Eddy Ilg, Philipp Fischer, Caner Hazir- bas, Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. What makes good synthetic training data for learning dis- parity and optical flow estimation?International Journal of Computer Vision, 126(9):942–960, 2018. 2

2018

[36] [36]

Conditional detr for fast training convergence

Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, and Jingdong Wang. Conditional detr for fast training convergence. InProceed- ings of the IEEE/CVF international conference on computer vision, pages 3651–3660, 2021. 2

2021

[37] [37]

How useful is self- supervised pretraining for visual tasks? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7345–7354, 2020

Alejandro Newell and Jia Deng. How useful is self- supervised pretraining for visual tasks? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7345–7354, 2020. 2

2020

[38] [38]

Dinov2: Learning robust visual features without supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 3

Pith/arXiv arXiv 2023

[39] [39]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 3

2021

[40] [40]

Yolo9000: better, faster, stronger

Joseph Redmon and Ali Farhadi. Yolo9000: better, faster, stronger. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7263–7271, 2017. 1, 2

2017

[41] [41]

Yolov3: An incremental improvement.arXiv preprint arXiv:1804.02767, 2018

Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement.arXiv preprint arXiv:1804.02767, 2018

Pith/arXiv arXiv 2018

[42] [42]

You only look once: Unified, real-time object de- tection

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object de- tection. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016. 1, 2

2016

[43] [43]

Faster r-cnn: Towards real-time object detection with region proposal networks.Advances in neural information process- ing systems, 28, 2015

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks.Advances in neural information process- ing systems, 28, 2015. 1, 2

2015

[44] [44]

Playing for data: Ground truth from computer games

Stephan R Richter, Vibhav Vineet, Stefan Roth, and Vladlen Koltun. Playing for data: Ground truth from computer games. InEuropean conference on computer vision, pages 102–118. Springer, 2016. 2

2016

[45] [45]

Improved techniques for training gans.Advances in neural information processing systems, 29, 2016

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans.Advances in neural information processing systems, 29, 2016. 1, 2

2016

[46] [46]

Learning from synthetic data: Addressing domain shift for semantic segmentation

Swami Sankaranarayanan, Yogesh Balaji, Arpit Jain, Ser Nam Lim, and Rama Chellappa. Learning from synthetic data: Addressing domain shift for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3752–3761, 2018. 1, 2

2018

[47] [47]

From gaming to research: Gta v for synthetic data generation for robotics and navigations

Matteo Scucchia, Paula Arranz, Matteo Ferrara, and Davide Maltoni. From gaming to research: Gta v for synthetic data generation for robotics and navigations. In2025 7th In- ternational Conference on Robotics and Computer Vision (ICRCV), pages 187–196. IEEE, 2025. 2

2025

[48] [48]

Revisiting unreasonable effectiveness of data in deep learning era

Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhi- nav Gupta. Revisiting unreasonable effectiveness of data in deep learning era. InProceedings of the IEEE international conference on computer vision, pages 843–852, 2017. 1, 2

2017

[49] [49]

Im- proving the effectiveness of deep generative data

Ruyu Wang, Sabrina Schmedding, and Marco F Huber. Im- proving the effectiveness of deep generative data. InPro- ceedings of the IEEE/CVF Winter Conference on Applica- tions of Computer Vision, pages 4922–4932, 2024. 2

2024

[50] [50]

Frustratingly simple few-shot object detection.arXiv preprint arXiv:2003.06957, 2020

Xin Wang, Thomas E Huang, Trevor Darrell, Joseph E Gon- zalez, and Fisher Yu. Frustratingly simple few-shot object detection.arXiv preprint arXiv:2003.06957, 2020. 1

arXiv 2003

[51] [51]

Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341,

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341,

Pith/arXiv arXiv

[52] [52]

Imagere- ward: Learning and evaluating human preferences for text- to-image generation.Advances in Neural Information Pro- cessing Systems, 36:15903–15935, 2023

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagere- ward: Learning and evaluating human preferences for text- to-image generation.Advances in Neural Information Pro- cessing Systems, 36:15903–15935, 2023. 2

2023

[53] [53]

Dino: Detr with improved denoising anchor boxes for end-to-end object de- tection.arXiv preprint arXiv:2203.03605, 2022

Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M Ni, and Heung-Yeung Shum. Dino: Detr with improved denoising anchor boxes for end-to-end object de- tection.arXiv preprint arXiv:2203.03605, 2022. 2 8

Pith/arXiv arXiv 2022

[54] [54]

De- trs beat yolos on real-time object detection

Yian Zhao, Wenyu Lv, Shangliang Xu, Jinman Wei, Guanzhong Wang, Qingqing Dang, Yi Liu, and Jie Chen. De- trs beat yolos on real-time object detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16965–16974, 2024. 1, 2

2024

[55] [55]

Deformable detr: Deformable trans- formers for end-to-end object detection.arXiv preprint arXiv:2010.04159, 2020

Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable trans- formers for end-to-end object detection.arXiv preprint arXiv:2010.04159, 2020. 2

Pith/arXiv arXiv 2010

[56] [56]

Object detection in 20 years: A survey.Proceed- ings of the IEEE, 111(3):257–276, 2023

Zhengxia Zou, Keyan Chen, Zhenwei Shi, Yuhong Guo, and Jieping Ye. Object detection in 20 years: A survey.Proceed- ings of the IEEE, 111(3):257–276, 2023. 2 9

2023