arxiv: 2604.02966 · v1 · submitted 2026-04-03 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Visual Prototype Conditioned Focal Region Generation for UAV-Based Object Detection

Wenhao Li , Zimeng Wu , Yu Wu , Zehua Fu , Jiaxin Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:53 UTC · model grok-4.3

classification 💻 cs.CV

keywords UAV object detectiondiffusion modelssynthetic data generationvisual prototypesfocal regionsdata augmentationlayout-to-image

0 comments

The pith

UAVGen generates higher-fidelity synthetic images for UAV object detection by conditioning diffusion models on visual class prototypes and emphasizing focal regions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses data scarcity in UAV-based object detection by proposing a layout-to-image synthesis method that produces labeled training images. It builds a diffusion model around representative visual prototypes for each object class, embedding them directly into the generation process to create more accurate tiny-object instances. A separate pipeline then concentrates synthesis effort on foreground focal regions while refining labels to fix missing, extra, or misaligned objects. The resulting images are shown to raise detection accuracy when added to training sets for multiple detector architectures.

Core claim

UAVGen designs a Visual Prototype Conditioned Diffusion Model (VPC-DM) that constructs representative instances for each class and integrates them into latent embeddings for high-fidelity object generation. It pairs this with a Focal Region Enhanced Data Pipeline (FRE-DP) that emphasizes object-concentrated foreground regions in synthesis, combined with a label refinement step to correct missing, extra and misaligned generations.

What carries the argument

Visual Prototype Conditioned Diffusion Model (VPC-DM) that embeds class-representative object instances into latent space for generation, together with Focal Region Enhanced Data Pipeline (FRE-DP) for foreground focus and label correction.

Load-bearing premise

The synthetic images produced by prototype conditioning and focal-region refinement have a distribution close enough to real UAV photos that they do not introduce biases or artifacts harmful to downstream detector training.

What would settle it

Measure whether detectors trained on real UAV data plus UAVGen images achieve the reported accuracy gains over real data alone when evaluated on a large held-out set of genuine UAV images.

Figures

Figures reproduced from arXiv: 2604.02966 by Jiaxin Chen, Wenhao Li, Yu Wu, Zehua Fu, Zimeng Wu.

**Figure 2.** Figure 2: Architecture of Visual Prototype Conditioned Focal Region Generation. (a) Virtual Prototype Conditioned Diffusion Model (VPC-DM) generates images guided by layout images which is produced from selected visual prototypes. (b) Focal Region-Enhanced Data Pipeline (FRE-DP) synthesizes images on object-centric areas to avoid limitation of small object generation. Moreover, Label Refinement mitigates the misalig… view at source ↗

**Figure 3.** Figure 3: Comparison of mAP across different categories on Vis [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of generated images on VisDrone. Our method exhibits superior layout-image consistency and enhanced visual [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Impact of various scale of generated images on object [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Unmanned aerial vehicle (UAV) based object detection is a critical but challenging task, when applied in dynamically changing scenarios with limited annotated training data. Layout-to-image generation approaches have proved effective in promoting detection accuracy by synthesizing labeled images based on diffusion models. However, they suffer from frequently producing artifacts, especially near layout boundaries of tiny objects, thus substantially limiting their performance. To address these issues, we propose UAVGen, a novel layout-to-image generation framework tailored for UAV-based object detection. Specifically, UAVGen designs a Visual Prototype Conditioned Diffusion Model (VPC-DM) that constructs representative instances for each class and integrates them into latent embeddings for high-fidelity object generation. Moreover, a Focal Region Enhanced Data Pipeline (FRE-DP) is introduced to emphasize object-concentrated foreground regions in synthesis, combined with a label refinement to correct missing, extra and misaligned generations. Extensive experimental results demonstrate that our method significantly outperforms state-of-the-art approaches, and consistently promotes accuracy when integrated with distinct detectors. The source code is available at https://github.com/Sirius-Li/UAVGen.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UAVGen adapts diffusion models with visual prototypes and focal pipelines to cut artifacts in synthetic UAV images for tiny-object detection, and the code release makes the claims checkable.

read the letter

The main point is that this paper targets artifact problems around small objects in layout-to-image diffusion for UAV detection. It adds a Visual Prototype Conditioned Diffusion Model that builds per-class instances and folds them into the latent embeddings, plus a Focal Region Enhanced Data Pipeline that stresses foreground areas and corrects missing, extra, or shifted labels after generation. These steps aim to produce training images that better match real aerial distributions when annotations are scarce. The code is public, which lets others test the pipeline directly on their detectors. That combination of targeted conditioning and post-generation fixes is the concrete addition over standard diffusion layout methods. The construction reads as internally consistent, with a clear technical route from prototypes to label refinement. Releasing the implementation is the strongest practical element here. The soft spot is the performance claims. The abstract states clear outperformance and consistent gains across detectors, yet the details of those experiments, baselines, and ablations are not visible in the summary. Without those numbers it is hard to judge how much the generated images actually help versus introduce subtle shifts. The central assumption that the outputs stay close enough to real UAV data to avoid degrading downstream training needs the full results to hold up. This work is for researchers in aerial or drone-based object detection who are short on labeled data and want to try synthetic augmentation. A reader already working on small-object or remote-sensing detection would get the most from it. It deserves peer review because the method is spelled out, the code is available for verification, and the problem it attacks is real even if the size of the gains still needs confirmation.

Referee Report

0 major / 2 minor

Summary. The paper proposes UAVGen, a layout-to-image generation framework for UAV-based object detection under limited annotated data. It introduces the Visual Prototype Conditioned Diffusion Model (VPC-DM) that constructs class-representative instances and integrates them into latent embeddings for high-fidelity synthesis, along with the Focal Region Enhanced Data Pipeline (FRE-DP) that emphasizes foreground regions and applies label refinement to correct missing, extra, or misaligned objects. The central claim is that extensive experiments show significant outperformance over state-of-the-art methods and consistent accuracy gains when the generated data is used to train distinct detectors.

Significance. If the reported gains hold under rigorous validation, the work could meaningfully advance synthetic data augmentation for UAV object detection by reducing artifacts near tiny objects and improving distribution match to real aerial imagery. The open availability of code is a clear strength that supports reproducibility and enables direct testing of the pipeline's effect on downstream detectors.

minor comments (2)

Abstract: The claim of 'significantly outperforms state-of-the-art approaches' would be more informative if accompanied by at least one concrete metric (e.g., mAP improvement on a named UAV dataset) rather than remaining purely qualitative.
Method description: The integration of visual prototypes into latent embeddings (VPC-DM) and the precise mechanism of focal-region emphasis plus label refinement (FRE-DP) would benefit from an explicit statement of how these steps are combined in the overall training objective or inference schedule.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation for minor revision. We are pleased that the contributions of UAVGen, including VPC-DM and FRE-DP, are viewed as potentially advancing synthetic data augmentation for UAV object detection. No specific major comments were raised in the report, so we interpret the minor revision request as an opportunity to polish presentation and add any clarifying details where helpful.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper proposes a new layout-to-image generation framework (UAVGen) consisting of VPC-DM for visual prototype conditioning in diffusion models and FRE-DP for focal region emphasis with label refinement. The central performance claims rest on empirical experiments showing outperformance over SOTA and gains when integrated with detectors. No derivation chain, equations, or fitted parameters are presented that reduce the outputs to inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked in the provided text to justify core components. The method is self-contained as a technical proposal with falsifiable code, yielding no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard diffusion model assumptions for image synthesis and the premise that prototype conditioning plus focal enhancement will reduce boundary artifacts for tiny objects; no explicit free parameters or new physical entities are stated in the abstract.

axioms (1)

domain assumption Diffusion models can produce high-fidelity object instances when conditioned on representative visual prototypes
Core premise of the VPC-DM component described in the abstract.

pith-pipeline@v0.9.0 · 5500 in / 1113 out tokens · 50340 ms · 2026-05-13T20:53:20.348804+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

UAVGen designs a Visual Prototype Conditioned Diffusion Model (VPC-DM) that constructs representative instances for each class and integrates them into latent embeddings for high-fidelity object generation. Moreover, a Focal Region Enhanced Data Pipeline (FRE-DP) is introduced to emphasize object-concentrated foreground regions in synthesis, combined with a label refinement to correct missing, extra and misaligned generations.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Extensive experimental results demonstrate that our method significantly outperforms state-of-the-art approaches, and consistently promotes accuracy when integrated with distinct detectors.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 4 internal anchors

[1]

Multidiffusion: Fusing diffusion paths for controlled image generation

Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. Multidiffusion: Fusing diffusion paths for controlled image generation. InProceedings of the International Conference on Machine Learning, pages 1737–1752, 2023. 2

work page 2023
[2]

Large scale gan training for high fidelity natural image synthesis

Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. InProceedings of the International Conference on Learning Representations, pages 9256–9291, 2018. 2

work page 2018
[3]

Geodiffusion: Text- prompted geometric control for object detection data gen- eration

Kai Chen, Enze Xie, Zhe Chen, Yibo Wang, Lanqing Hong, Zhenguo Li, and Dit-Yan Yeung. Geodiffusion: Text- prompted geometric control for object detection data gen- eration. InProceedings of the International Conference on Learning Representations, pages 846–868, 2024. 1, 2, 6, 7

work page 2024
[4]

Layoutdiffuse: Adapting foundational dif- fusion models for layout-to-image generation.arXiv preprint arXiv:2302.08908, 2023

Jiaxin Cheng, Xiao Liang, Xingjian Shi, Tong He, Tianjun Xiao, and Mu Li. Layoutdiffuse: Adapting foundational dif- fusion models for layout-to-image generation.arXiv preprint arXiv:2302.08908, 2023. 2

work page arXiv 2023
[5]

Diffusion models beat gans on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. InAdvances in Neural Infor- mation Processing Systems, pages 4643–4651, 2025. 2

work page 2025
[6]

Quantifying the simulation–reality gap for deep learning-based drone detection.Electronics, 12(10): 2197, 2023

Tamara Regina Dieter, Andreas Weinmann, Stefan J¨ager, and Eva Brucherseifer. Quantifying the simulation–reality gap for deep learning-based drone detection.Electronics, 12(10): 2197, 2023. 1

work page 2023
[7]

Adaptive sparse convolutional networks with global context enhancement for faster object detection on drone images

Bowei Du, Yecheng Huang, Jiaxin Chen, and Di Huang. Adaptive sparse convolutional networks with global context enhancement for faster object detection on drone images. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 13435–13444, 2023. 1, 2

work page 2023
[8]

The unmanned aerial vehicle benchmark: Object detection and tracking

Dawei Du, Yuankai Qi, Hongyang Yu, Yifan Yang, Kaiwen Duan, Guorong Li, Weigang Zhang, Qingming Huang, and Qi Tian. The unmanned aerial vehicle benchmark: Object detection and tracking. InProceedings of the European Con- ference on Computer Vision, pages 370–386, 2018. 1, 6, 2

work page 2018
[9]

Visdrone-det2019: The vision meets drone ob- ject detection in image challenge results

Dawei Du, Pengfei Zhu, Longyin Wen, Xiao Bian, Haibin Lin, Qinghua Hu, Tao Peng, Jiayu Zheng, Xinyao Wang, Yue Zhang, et al. Visdrone-det2019: The vision meets drone ob- ject detection in image challenge results. InProceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 213–226, 2019. 1, 6, 2

work page 2019
[10]

Mod- eling visual context is key to augmenting object detection datasets

Nikita Dvornik, Julien Mairal, and Cordelia Schmid. Mod- eling visual context is key to augmenting object detection datasets. InProceedings of the European Conference on Computer Vision, pages 364–380, 2018. 3

work page 2018
[11]

Cut, paste and learn: Surprisingly easy synthesis for instance de- tection

Debidatta Dwibedi, Ishan Misra, and Martial Hebert. Cut, paste and learn: Surprisingly easy synthesis for instance de- tection. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 1301–1310, 2017. 1, 3, 6

work page 2017
[12]

Help from the sky: Leveraging uavs for dis- aster management.IEEE Pervasive Computing, 16(1):24– 32, 2017

Milan Erdelj, Enrico Natalizio, Kaushik R Chowdhury, and Ian F Akyildiz. Help from the sky: Leveraging uavs for dis- aster management.IEEE Pervasive Computing, 16(1):24– 32, 2017. 1

work page 2017
[13]

The pascal visual object classes (voc) challenge.International Journal of Computer Vision, 88(2):303–338, 2010

Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge.International Journal of Computer Vision, 88(2):303–338, 2010. 6

work page 2010
[14]

Magicdrive: Street view generation with diverse 3d geometry control

Ruiyuan Gao, Kai Chen, Enze Xie, Lanqing HONG, Zhen- guo Li, Dit-Yan Yeung, and Qiang Xu. Magicdrive: Street view generation with diverse 3d geometry control. InPro- ceedings of the International Conference on Learning Rep- resentations, pages 904–923, 2024. 1, 3

work page 2024
[15]

Simple copy-paste is a strong data augmentation method for instance segmentation

Golnaz Ghiasi, Yin Cui, Aravind Srinivas, Rui Qian, Tsung- Yi Lin, Ekin D Cubuk, Quoc V Le, and Barret Zoph. Simple copy-paste is a strong data augmentation method for instance segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2918– 2928, 2021. 1, 3

work page 2021
[16]

Generative adversarial nets

Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. InAdvances in Neural Information Processing Systems, pages 2672–2680,

work page
[17]

Gans trained by a two time-scale update rule converge to a local nash equilib- rium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. InAdvances in Neural Information Processing Sys- tems, page 6629–6640, 2017. 6

work page 2017
[18]

Denoising dif- fusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models. InAdvances in Neural Informa- tion Processing Systems, pages 6840–6851, 2020. 2

work page 2020
[19]

Eija Honkavaara, Heikki Saari, Jere Kaivosoja, Ilkka P¨ol¨onen, Teemu Hakala, Paula Litkey, Jussi M ¨akynen, and Liisa Pesonen. Processing and assessment of spectrometric, stereoscopic imagery collected using a lightweight uav spec- tral camera for precision agriculture.Remote Sensing, 5(10): 5006–5039, 2013. 1

work page 2013
[20]

Decen- tralized autonomous navigation of a uav network for road traffic monitoring.IEEE Transactions on Aerospace and Electronic Systems, 57(4):2558–2564, 2021

Hailong Huang, Andrey V Savkin, and Chao Huang. Decen- tralized autonomous navigation of a uav network for road traffic monitoring.IEEE Transactions on Aerospace and Electronic Systems, 57(4):2558–2564, 2021. 1

work page 2021
[21]

Ufpmp-det: Toward accurate and efficient object detection on drone im- agery

Yecheng Huang, Jiaxin Chen, and Di Huang. Ufpmp-det: Toward accurate and efficient object detection on drone im- agery. InProceedings of the AAAI conference on Artificial Intelligence, pages 1026–1033, 2022. 1, 2

work page 2022
[22]

High- resolution complex scene synthesis with transformers

Manuel Jahn, Robin Rombach, and Bj ¨orn Ommer. High- resolution complex scene synthesis with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 7054–7065, 2021. 2

work page 2021
[23]

A style-based generator architecture for generative adversarial networks

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. 9 InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4401–4410, 2019. 2

work page 2019
[24]

YOLOv11: An Overview of the Key Architectural Enhancements

Rahima Khanam and Muhammad Hussain. Yolov11: An overview of the key architectural enhancements.arXiv preprint arXiv:2410.17725, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Skyscenes: A syn- thetic dataset for aerial scene understanding

Sahil Khose, Anisha Pal, Aayushi Agarwal, Deepanshi, Judy Hoffman, and Prithvijit Chattopadhyay. Skyscenes: A syn- thetic dataset for aerial scene understanding. InProceedings of the European Conference on Computer Vision, pages 19– 35, 2024. 3

work page 2024
[26]

Variational diffusion models

Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. InAdvances in Neural Information Processing Systems, pages 21696–21707, 2021. 2

work page 2021
[27]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding varia- tional bayes.arXiv preprint arXiv:1312.6114, 2013. 2

work page internal anchor Pith review Pith/arXiv arXiv 2013
[28]

Remdet: Rethinking efficient model design for uav ob- ject detection

Chen Li, Rui Zhao, Zeyu Wang, Huiying Xu, and Xinzhong Zhu. Remdet: Rethinking efficient model design for uav ob- ject detection. InProceedings of the AAAI Conference on Artificial Intelligence, pages 4643–4651, 2025. 1, 2, 7

work page 2025
[29]

Trackdif- fusion: Tracklet-conditioned video generation via diffusion models

Pengxiang Li, Kai Chen, Zhili Liu, Ruiyuan Gao, Lanqing Hong, Dit-Yan Yeung, Huchuan Lu, and Xu Jia. Trackdif- fusion: Tracklet-conditioned video generation via diffusion models. InProceedings of the IEEE/CVF Winter Confer- ence on Applications of Computer Vision, pages 3539–3548,

work page
[30]

Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection

Xiang Li, Wenhai Wang, Lijun Wu, Shuo Chen, Xiaolin Hu, Jun Li, Jinhui Tang, and Jian Yang. Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection. InAdvances in Neural Information Pro- cessing Systems, pages 21002–21012, 2020. 1, 2, 7

work page 2020
[31]

Gligen: Open-set grounded text-to-image generation

Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jian- wei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22511–22521, 2023. 2, 4, 6, 7

work page 2023
[32]

Image synthesis from layout with locality- aware mask adaption

Zejian Li, Jingyu Wu, Immanuel Koh, Yongchuan Tang, and Lingyun Sun. Image synthesis from layout with locality- aware mask adaption. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 13819– 13828, 2021. 2

work page 2021
[33]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vi- sion, pages 740–755, 2014. 6

work page 2014
[34]

Improved denoising diffusion probabilistic models

Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. InProceedings of the International Conference on Machine Learning, pages 8162–8171, 2021. 2

work page 2021
[35]

YOLOv3: An Incremental Improvement

Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement.arXiv preprint arXiv:1804.02767, 2018. 2, 1

work page internal anchor Pith review Pith/arXiv arXiv 2018
[36]

You only look once: Unified, real-time object detec- tion

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detec- tion. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 779–788, 2016. 1, 2

work page 2016
[37]

Faster r-cnn: Towards real-time object detection with region proposal networks.IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6):1137–1149, 2016

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks.IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6):1137–1149, 2016. 6, 1

work page 2016
[38]

Stochastic backpropagation and approximate inference in deep generative models

Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wier- stra. Stochastic backpropagation and approximate inference in deep generative models. InProceedings of the Interna- tional Conference on Machine Learning, pages 1278–1286,

work page
[39]

Syndrone-multi-modal uav dataset for ur- ban scenarios

Giulia Rizzoli, Francesco Barbato, Matteo Caligiuri, and Pietro Zanuttigh. Syndrone-multi-modal uav dataset for ur- ban scenarios. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 2210–2220,

work page
[40]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022. 2

work page 2022
[41]

Stereo vision three-dimensional terrain maps for precision agricul- ture.Computers and Electronics in Agriculture, 60(2):133– 143, 2008

Francisco Rovira-M ´as, Qin Zhang, and John F Reid. Stereo vision three-dimensional terrain maps for precision agricul- ture.Computers and Electronics in Agriculture, 60(2):133– 143, 2008. 1

work page 2008
[42]

Progressive transformation learning for leveraging virtual images in training

Yi-Ting Shen, Hyungtae Lee, Heesung Kwon, and Shu- vra S Bhattacharyya. Progressive transformation learning for leveraging virtual images in training. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 835–844, 2023. 3

work page 2023
[43]

Deep unsupervised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InProceedings of the In- ternational Conference on Machine Learning, pages 2256– 2265, 2015. 2

work page 2015
[44]

Denois- ing diffusion implicit models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models. InProceedings of the In- ternational Conference on Learning Representations, pages 14205–14224, 2021

work page 2021
[45]

Improved techniques for training score-based generative models

Yang Song and Stefano Ermon. Improved techniques for training score-based generative models. InAdvances in Neu- ral Information Processing Systems, pages 12438–12448,

work page
[46]

Aerogen: Enhancing remote sensing object detection with diffusion-driven data generation

Datao Tang, Xiangyong Cao, Xuan Wu, Jialin Li, Jing Yao, Xueru Bai, Dongsheng Jiang, Yin Li, and Deyu Meng. Aerogen: Enhancing remote sensing object detection with diffusion-driven data generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3614–3624, 2025. 1, 3, 6, 7

work page 2025
[47]

YOLOv12: Attention-Centric Real-Time Object Detectors

Yunjie Tian, Qixiang Ye, and David Doermann. Yolov12: Attention-centric real-time object detectors.arXiv preprint arXiv:2502.12524, 2025. 2

work page internal anchor Pith review arXiv 2025
[48]

Satsynth: Augmenting image-mask pairs through diffusion models for aerial semantic segmentation

Aysim Toker, Marvin Eisenberger, Daniel Cremers, and Laura Leal-Taix´e. Satsynth: Augmenting image-mask pairs through diffusion models for aerial semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27695–27705, 2024. 3

work page 2024
[49]

Nvae: A deep hierarchical 10 variational autoencoder

Arash Vahdat and Jan Kautz. Nvae: A deep hierarchical 10 variational autoencoder. InAdvances in Neural Information Processing Systems, pages 19667–19679, 2020. 2

work page 2020
[50]

Instancediffusion: Instance- level control for image generation

Xudong Wang, Trevor Darrell, Sai Saketh Rambhatla, Ro- hit Girdhar, and Ishan Misra. Instancediffusion: Instance- level control for image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6232–6242, 2024. 3

work page 2024
[51]

Detdiffusion: Synergizing gen- erative and perceptive models for enhanced data generation and perception

Yibo Wang, Ruiyuan Gao, Kai Chen, Kaiqiang Zhou, Yingjie Cai, Lanqing Hong, Zhenguo Li, Lihui Jiang, Dit- Yan Yeung, Qiang Xu, et al. Detdiffusion: Synergizing gen- erative and perceptive models for enhanced data generation and perception. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7246– 7255, 2024. 1

work page 2024
[52]

Domain adaptive ob- ject detection for uav-based images by robust representation learning and multiple pseudo-label aggregation

Ke Wu, Jiaxin Chen, and Miao Wang. Domain adaptive ob- ject detection for uav-based images by robust representation learning and multiple pseudo-label aggregation. InProceed- ings of the ACM MM Workshops on Efficient Multimedia Computing under Limited, page 59–67, 2024. 1

work page 2024
[53]

Datasetdm: Synthesizing data with perception anno- tations using diffusion models

Weijia Wu, Yuzhong Zhao, Hao Chen, Yuchao Gu, Rui Zhao, Yefei He, Hong Zhou, Mike Zheng Shou, and Chunhua Shen. Datasetdm: Synthesizing data with perception anno- tations using diffusion models. InAdvances in Neural Infor- mation Processing Systems, pages 54683–54695, 2023. 1, 3

work page 2023
[54]

Boxdiff: Text-to-image synthesis with training-free box-constrained diffusion

Jinheng Xie, Yuexiang Li, Yawen Huang, Haozhe Liu, Wen- tian Zhang, Yefeng Zheng, and Mike Zheng Shou. Boxdiff: Text-to-image synthesis with training-free box-constrained diffusion. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7452–7461, 2023. 2

work page 2023
[55]

Reco: Region-controlled text-to-image genera- tion

Zhengyuan Yang, Jianfeng Wang, Zhe Gan, Linjie Li, Kevin Lin, Chenfei Wu, Nan Duan, Zicheng Liu, Ce Liu, Michael Zeng, et al. Reco: Region-controlled text-to-image genera- tion. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 14246–14255,

work page
[56]

Synplay: Importing real-world diversity for a synthetic hu- man dataset.arXiv e-prints, pages arXiv–2408, 2024

Jinsub Yim, Hyungtae Lee, Sungmin Eum, Yi-Ting Shen, Yan Zhang, Heesung Kwon, and Shuvra S Bhattacharyya. Synplay: Importing real-world diversity for a synthetic hu- man dataset.arXiv e-prints, pages arXiv–2408, 2024. 3

work page 2024
[57]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023. 2

work page 2023
[58]

Datasetgan: Efficient labeled data factory with minimal human effort

Yuxuan Zhang, Huan Ling, Jun Gao, Kangxue Yin, Jean- Francois Lafleche, Adela Barriuso, Antonio Torralba, and Sanja Fidler. Datasetgan: Efficient labeled data factory with minimal human effort. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10145–10155, 2021. 3

work page 2021
[59]

X-paste: Revisiting scalable copy- paste for instance segmentation using clip and stablediffu- sion

Hanqing Zhao, Dianmo Sheng, Jianmin Bao, Dongdong Chen, Dong Chen, Fang Wen, Lu Yuan, Ce Liu, Wenbo Zhou, Qi Chu, et al. X-paste: Revisiting scalable copy- paste for instance segmentation using clip and stablediffu- sion. InProceedings of the International Conference on Ma- chine Learning, pages 42098–42109, 2023. 1, 3

work page 2023
[60]

Layoutdiffusion: Controllable diffu- sion model for layout-to-image generation

Guangcong Zheng, Xianpan Zhou, Xuewei Li, Zhongang Qi, Ying Shan, and Xi Li. Layoutdiffusion: Controllable diffu- sion model for layout-to-image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 22490–22499, 2023. 2

work page 2023
[61]

Migc: Multi-instance generation controller for text-to-image synthesis

Dewei Zhou, You Li, Fan Ma, Xiaoting Zhang, and Yi Yang. Migc: Multi-instance generation controller for text-to-image synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6818– 6828, 2024. 2

work page 2024
[62]

Odgen: Domain- specific object detection data generation with diffusion mod- els

Jingyuan Zhu, Shiyu Li, Yuxuan Andy Liu, Jian Yuan, Ping Huang, Jiulong Shan, and Huimin Ma. Odgen: Domain- specific object detection data generation with diffusion mod- els. InAdvances in Neural Information Processing Systems, pages 63599–63633, 2024. 1 11 Visual Prototype Conditioned Focal Region Generation for UA V-Based Object Detection Supplementary ...

work page 2024
[63]

All experiments were conducted on 8 NVIDIA RTX 3080Ti GPUs

and UA VDT [8]. All experiments were conducted on 8 NVIDIA RTX 3080Ti GPUs. In terms of UA V-based object detection model, following the default experimental set of Remdet [28], we trained the detector on VisDrone dataset for 300 epochs, with a learn- ing rate of 0.01, and applied data augmentation techniques such as mixup and Mosaic. The input image size...

work page