pith. machine review for the scientific record. sign in

arxiv: 2604.02966 · v1 · submitted 2026-04-03 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Visual Prototype Conditioned Focal Region Generation for UAV-Based Object Detection

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:53 UTC · model grok-4.3

classification 💻 cs.CV
keywords UAV object detectiondiffusion modelssynthetic data generationvisual prototypesfocal regionsdata augmentationlayout-to-image
0
0 comments X

The pith

UAVGen generates higher-fidelity synthetic images for UAV object detection by conditioning diffusion models on visual class prototypes and emphasizing focal regions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses data scarcity in UAV-based object detection by proposing a layout-to-image synthesis method that produces labeled training images. It builds a diffusion model around representative visual prototypes for each object class, embedding them directly into the generation process to create more accurate tiny-object instances. A separate pipeline then concentrates synthesis effort on foreground focal regions while refining labels to fix missing, extra, or misaligned objects. The resulting images are shown to raise detection accuracy when added to training sets for multiple detector architectures.

Core claim

UAVGen designs a Visual Prototype Conditioned Diffusion Model (VPC-DM) that constructs representative instances for each class and integrates them into latent embeddings for high-fidelity object generation. It pairs this with a Focal Region Enhanced Data Pipeline (FRE-DP) that emphasizes object-concentrated foreground regions in synthesis, combined with a label refinement step to correct missing, extra and misaligned generations.

What carries the argument

Visual Prototype Conditioned Diffusion Model (VPC-DM) that embeds class-representative object instances into latent space for generation, together with Focal Region Enhanced Data Pipeline (FRE-DP) for foreground focus and label correction.

Load-bearing premise

The synthetic images produced by prototype conditioning and focal-region refinement have a distribution close enough to real UAV photos that they do not introduce biases or artifacts harmful to downstream detector training.

What would settle it

Measure whether detectors trained on real UAV data plus UAVGen images achieve the reported accuracy gains over real data alone when evaluated on a large held-out set of genuine UAV images.

Figures

Figures reproduced from arXiv: 2604.02966 by Jiaxin Chen, Wenhao Li, Yu Wu, Zehua Fu, Zimeng Wu.

Figure 1
Figure 1. Figure 1: Illustration of different layout-to-images data generation [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Architecture of Visual Prototype Conditioned Focal Region Generation. (a) Virtual Prototype Conditioned Diffusion Model (VPC-DM) generates images guided by layout images which is produced from selected visual prototypes. (b) Focal Region-Enhanced Data Pipeline (FRE-DP) synthesizes images on object-centric areas to avoid limitation of small object generation. Moreover, Label Refinement mitigates the misalig… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of mAP across different categories on Vis [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of generated images on VisDrone. Our method exhibits superior layout-image consistency and enhanced visual [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Impact of various scale of generated images on object [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Unmanned aerial vehicle (UAV) based object detection is a critical but challenging task, when applied in dynamically changing scenarios with limited annotated training data. Layout-to-image generation approaches have proved effective in promoting detection accuracy by synthesizing labeled images based on diffusion models. However, they suffer from frequently producing artifacts, especially near layout boundaries of tiny objects, thus substantially limiting their performance. To address these issues, we propose UAVGen, a novel layout-to-image generation framework tailored for UAV-based object detection. Specifically, UAVGen designs a Visual Prototype Conditioned Diffusion Model (VPC-DM) that constructs representative instances for each class and integrates them into latent embeddings for high-fidelity object generation. Moreover, a Focal Region Enhanced Data Pipeline (FRE-DP) is introduced to emphasize object-concentrated foreground regions in synthesis, combined with a label refinement to correct missing, extra and misaligned generations. Extensive experimental results demonstrate that our method significantly outperforms state-of-the-art approaches, and consistently promotes accuracy when integrated with distinct detectors. The source code is available at https://github.com/Sirius-Li/UAVGen.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper proposes UAVGen, a layout-to-image generation framework for UAV-based object detection under limited annotated data. It introduces the Visual Prototype Conditioned Diffusion Model (VPC-DM) that constructs class-representative instances and integrates them into latent embeddings for high-fidelity synthesis, along with the Focal Region Enhanced Data Pipeline (FRE-DP) that emphasizes foreground regions and applies label refinement to correct missing, extra, or misaligned objects. The central claim is that extensive experiments show significant outperformance over state-of-the-art methods and consistent accuracy gains when the generated data is used to train distinct detectors.

Significance. If the reported gains hold under rigorous validation, the work could meaningfully advance synthetic data augmentation for UAV object detection by reducing artifacts near tiny objects and improving distribution match to real aerial imagery. The open availability of code is a clear strength that supports reproducibility and enables direct testing of the pipeline's effect on downstream detectors.

minor comments (2)
  1. Abstract: The claim of 'significantly outperforms state-of-the-art approaches' would be more informative if accompanied by at least one concrete metric (e.g., mAP improvement on a named UAV dataset) rather than remaining purely qualitative.
  2. Method description: The integration of visual prototypes into latent embeddings (VPC-DM) and the precise mechanism of focal-region emphasis plus label refinement (FRE-DP) would benefit from an explicit statement of how these steps are combined in the overall training objective or inference schedule.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation for minor revision. We are pleased that the contributions of UAVGen, including VPC-DM and FRE-DP, are viewed as potentially advancing synthetic data augmentation for UAV object detection. No specific major comments were raised in the report, so we interpret the minor revision request as an opportunity to polish presentation and add any clarifying details where helpful.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper proposes a new layout-to-image generation framework (UAVGen) consisting of VPC-DM for visual prototype conditioning in diffusion models and FRE-DP for focal region emphasis with label refinement. The central performance claims rest on empirical experiments showing outperformance over SOTA and gains when integrated with detectors. No derivation chain, equations, or fitted parameters are presented that reduce the outputs to inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked in the provided text to justify core components. The method is self-contained as a technical proposal with falsifiable code, yielding no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard diffusion model assumptions for image synthesis and the premise that prototype conditioning plus focal enhancement will reduce boundary artifacts for tiny objects; no explicit free parameters or new physical entities are stated in the abstract.

axioms (1)
  • domain assumption Diffusion models can produce high-fidelity object instances when conditioned on representative visual prototypes
    Core premise of the VPC-DM component described in the abstract.

pith-pipeline@v0.9.0 · 5500 in / 1113 out tokens · 50340 ms · 2026-05-13T20:53:20.348804+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    UAVGen designs a Visual Prototype Conditioned Diffusion Model (VPC-DM) that constructs representative instances for each class and integrates them into latent embeddings for high-fidelity object generation. Moreover, a Focal Region Enhanced Data Pipeline (FRE-DP) is introduced to emphasize object-concentrated foreground regions in synthesis, combined with a label refinement to correct missing, extra and misaligned generations.

  • IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Extensive experimental results demonstrate that our method significantly outperforms state-of-the-art approaches, and consistently promotes accuracy when integrated with distinct detectors.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 4 internal anchors

  1. [1]

    Multidiffusion: Fusing diffusion paths for controlled image generation

    Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. Multidiffusion: Fusing diffusion paths for controlled image generation. InProceedings of the International Conference on Machine Learning, pages 1737–1752, 2023. 2

  2. [2]

    Large scale gan training for high fidelity natural image synthesis

    Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. InProceedings of the International Conference on Learning Representations, pages 9256–9291, 2018. 2

  3. [3]

    Geodiffusion: Text- prompted geometric control for object detection data gen- eration

    Kai Chen, Enze Xie, Zhe Chen, Yibo Wang, Lanqing Hong, Zhenguo Li, and Dit-Yan Yeung. Geodiffusion: Text- prompted geometric control for object detection data gen- eration. InProceedings of the International Conference on Learning Representations, pages 846–868, 2024. 1, 2, 6, 7

  4. [4]

    Layoutdiffuse: Adapting foundational dif- fusion models for layout-to-image generation.arXiv preprint arXiv:2302.08908, 2023

    Jiaxin Cheng, Xiao Liang, Xingjian Shi, Tong He, Tianjun Xiao, and Mu Li. Layoutdiffuse: Adapting foundational dif- fusion models for layout-to-image generation.arXiv preprint arXiv:2302.08908, 2023. 2

  5. [5]

    Diffusion models beat gans on image synthesis

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. InAdvances in Neural Infor- mation Processing Systems, pages 4643–4651, 2025. 2

  6. [6]

    Quantifying the simulation–reality gap for deep learning-based drone detection.Electronics, 12(10): 2197, 2023

    Tamara Regina Dieter, Andreas Weinmann, Stefan J¨ager, and Eva Brucherseifer. Quantifying the simulation–reality gap for deep learning-based drone detection.Electronics, 12(10): 2197, 2023. 1

  7. [7]

    Adaptive sparse convolutional networks with global context enhancement for faster object detection on drone images

    Bowei Du, Yecheng Huang, Jiaxin Chen, and Di Huang. Adaptive sparse convolutional networks with global context enhancement for faster object detection on drone images. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 13435–13444, 2023. 1, 2

  8. [8]

    The unmanned aerial vehicle benchmark: Object detection and tracking

    Dawei Du, Yuankai Qi, Hongyang Yu, Yifan Yang, Kaiwen Duan, Guorong Li, Weigang Zhang, Qingming Huang, and Qi Tian. The unmanned aerial vehicle benchmark: Object detection and tracking. InProceedings of the European Con- ference on Computer Vision, pages 370–386, 2018. 1, 6, 2

  9. [9]

    Visdrone-det2019: The vision meets drone ob- ject detection in image challenge results

    Dawei Du, Pengfei Zhu, Longyin Wen, Xiao Bian, Haibin Lin, Qinghua Hu, Tao Peng, Jiayu Zheng, Xinyao Wang, Yue Zhang, et al. Visdrone-det2019: The vision meets drone ob- ject detection in image challenge results. InProceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 213–226, 2019. 1, 6, 2

  10. [10]

    Mod- eling visual context is key to augmenting object detection datasets

    Nikita Dvornik, Julien Mairal, and Cordelia Schmid. Mod- eling visual context is key to augmenting object detection datasets. InProceedings of the European Conference on Computer Vision, pages 364–380, 2018. 3

  11. [11]

    Cut, paste and learn: Surprisingly easy synthesis for instance de- tection

    Debidatta Dwibedi, Ishan Misra, and Martial Hebert. Cut, paste and learn: Surprisingly easy synthesis for instance de- tection. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 1301–1310, 2017. 1, 3, 6

  12. [12]

    Help from the sky: Leveraging uavs for dis- aster management.IEEE Pervasive Computing, 16(1):24– 32, 2017

    Milan Erdelj, Enrico Natalizio, Kaushik R Chowdhury, and Ian F Akyildiz. Help from the sky: Leveraging uavs for dis- aster management.IEEE Pervasive Computing, 16(1):24– 32, 2017. 1

  13. [13]

    The pascal visual object classes (voc) challenge.International Journal of Computer Vision, 88(2):303–338, 2010

    Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge.International Journal of Computer Vision, 88(2):303–338, 2010. 6

  14. [14]

    Magicdrive: Street view generation with diverse 3d geometry control

    Ruiyuan Gao, Kai Chen, Enze Xie, Lanqing HONG, Zhen- guo Li, Dit-Yan Yeung, and Qiang Xu. Magicdrive: Street view generation with diverse 3d geometry control. InPro- ceedings of the International Conference on Learning Rep- resentations, pages 904–923, 2024. 1, 3

  15. [15]

    Simple copy-paste is a strong data augmentation method for instance segmentation

    Golnaz Ghiasi, Yin Cui, Aravind Srinivas, Rui Qian, Tsung- Yi Lin, Ekin D Cubuk, Quoc V Le, and Barret Zoph. Simple copy-paste is a strong data augmentation method for instance segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2918– 2928, 2021. 1, 3

  16. [16]

    Generative adversarial nets

    Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. InAdvances in Neural Information Processing Systems, pages 2672–2680,

  17. [17]

    Gans trained by a two time-scale update rule converge to a local nash equilib- rium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. InAdvances in Neural Information Processing Sys- tems, page 6629–6640, 2017. 6

  18. [18]

    Denoising dif- fusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models. InAdvances in Neural Informa- tion Processing Systems, pages 6840–6851, 2020. 2

  19. [19]

    Eija Honkavaara, Heikki Saari, Jere Kaivosoja, Ilkka P¨ol¨onen, Teemu Hakala, Paula Litkey, Jussi M ¨akynen, and Liisa Pesonen. Processing and assessment of spectrometric, stereoscopic imagery collected using a lightweight uav spec- tral camera for precision agriculture.Remote Sensing, 5(10): 5006–5039, 2013. 1

  20. [20]

    Decen- tralized autonomous navigation of a uav network for road traffic monitoring.IEEE Transactions on Aerospace and Electronic Systems, 57(4):2558–2564, 2021

    Hailong Huang, Andrey V Savkin, and Chao Huang. Decen- tralized autonomous navigation of a uav network for road traffic monitoring.IEEE Transactions on Aerospace and Electronic Systems, 57(4):2558–2564, 2021. 1

  21. [21]

    Ufpmp-det: Toward accurate and efficient object detection on drone im- agery

    Yecheng Huang, Jiaxin Chen, and Di Huang. Ufpmp-det: Toward accurate and efficient object detection on drone im- agery. InProceedings of the AAAI conference on Artificial Intelligence, pages 1026–1033, 2022. 1, 2

  22. [22]

    High- resolution complex scene synthesis with transformers

    Manuel Jahn, Robin Rombach, and Bj ¨orn Ommer. High- resolution complex scene synthesis with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 7054–7065, 2021. 2

  23. [23]

    A style-based generator architecture for generative adversarial networks

    Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. 9 InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4401–4410, 2019. 2

  24. [24]

    YOLOv11: An Overview of the Key Architectural Enhancements

    Rahima Khanam and Muhammad Hussain. Yolov11: An overview of the key architectural enhancements.arXiv preprint arXiv:2410.17725, 2024. 2

  25. [25]

    Skyscenes: A syn- thetic dataset for aerial scene understanding

    Sahil Khose, Anisha Pal, Aayushi Agarwal, Deepanshi, Judy Hoffman, and Prithvijit Chattopadhyay. Skyscenes: A syn- thetic dataset for aerial scene understanding. InProceedings of the European Conference on Computer Vision, pages 19– 35, 2024. 3

  26. [26]

    Variational diffusion models

    Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. InAdvances in Neural Information Processing Systems, pages 21696–21707, 2021. 2

  27. [27]

    Auto-Encoding Variational Bayes

    Diederik P Kingma and Max Welling. Auto-encoding varia- tional bayes.arXiv preprint arXiv:1312.6114, 2013. 2

  28. [28]

    Remdet: Rethinking efficient model design for uav ob- ject detection

    Chen Li, Rui Zhao, Zeyu Wang, Huiying Xu, and Xinzhong Zhu. Remdet: Rethinking efficient model design for uav ob- ject detection. InProceedings of the AAAI Conference on Artificial Intelligence, pages 4643–4651, 2025. 1, 2, 7

  29. [29]

    Trackdif- fusion: Tracklet-conditioned video generation via diffusion models

    Pengxiang Li, Kai Chen, Zhili Liu, Ruiyuan Gao, Lanqing Hong, Dit-Yan Yeung, Huchuan Lu, and Xu Jia. Trackdif- fusion: Tracklet-conditioned video generation via diffusion models. InProceedings of the IEEE/CVF Winter Confer- ence on Applications of Computer Vision, pages 3539–3548,

  30. [30]

    Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection

    Xiang Li, Wenhai Wang, Lijun Wu, Shuo Chen, Xiaolin Hu, Jun Li, Jinhui Tang, and Jian Yang. Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection. InAdvances in Neural Information Pro- cessing Systems, pages 21002–21012, 2020. 1, 2, 7

  31. [31]

    Gligen: Open-set grounded text-to-image generation

    Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jian- wei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22511–22521, 2023. 2, 4, 6, 7

  32. [32]

    Image synthesis from layout with locality- aware mask adaption

    Zejian Li, Jingyu Wu, Immanuel Koh, Yongchuan Tang, and Lingyun Sun. Image synthesis from layout with locality- aware mask adaption. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 13819– 13828, 2021. 2

  33. [33]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vi- sion, pages 740–755, 2014. 6

  34. [34]

    Improved denoising diffusion probabilistic models

    Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. InProceedings of the International Conference on Machine Learning, pages 8162–8171, 2021. 2

  35. [35]

    YOLOv3: An Incremental Improvement

    Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement.arXiv preprint arXiv:1804.02767, 2018. 2, 1

  36. [36]

    You only look once: Unified, real-time object detec- tion

    Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detec- tion. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 779–788, 2016. 1, 2

  37. [37]

    Faster r-cnn: Towards real-time object detection with region proposal networks.IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6):1137–1149, 2016

    Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks.IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6):1137–1149, 2016. 6, 1

  38. [38]

    Stochastic backpropagation and approximate inference in deep generative models

    Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wier- stra. Stochastic backpropagation and approximate inference in deep generative models. InProceedings of the Interna- tional Conference on Machine Learning, pages 1278–1286,

  39. [39]

    Syndrone-multi-modal uav dataset for ur- ban scenarios

    Giulia Rizzoli, Francesco Barbato, Matteo Caligiuri, and Pietro Zanuttigh. Syndrone-multi-modal uav dataset for ur- ban scenarios. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 2210–2220,

  40. [40]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022. 2

  41. [41]

    Stereo vision three-dimensional terrain maps for precision agricul- ture.Computers and Electronics in Agriculture, 60(2):133– 143, 2008

    Francisco Rovira-M ´as, Qin Zhang, and John F Reid. Stereo vision three-dimensional terrain maps for precision agricul- ture.Computers and Electronics in Agriculture, 60(2):133– 143, 2008. 1

  42. [42]

    Progressive transformation learning for leveraging virtual images in training

    Yi-Ting Shen, Hyungtae Lee, Heesung Kwon, and Shu- vra S Bhattacharyya. Progressive transformation learning for leveraging virtual images in training. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 835–844, 2023. 3

  43. [43]

    Deep unsupervised learning using nonequilibrium thermodynamics

    Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InProceedings of the In- ternational Conference on Machine Learning, pages 2256– 2265, 2015. 2

  44. [44]

    Denois- ing diffusion implicit models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models. InProceedings of the In- ternational Conference on Learning Representations, pages 14205–14224, 2021

  45. [45]

    Improved techniques for training score-based generative models

    Yang Song and Stefano Ermon. Improved techniques for training score-based generative models. InAdvances in Neu- ral Information Processing Systems, pages 12438–12448,

  46. [46]

    Aerogen: Enhancing remote sensing object detection with diffusion-driven data generation

    Datao Tang, Xiangyong Cao, Xuan Wu, Jialin Li, Jing Yao, Xueru Bai, Dongsheng Jiang, Yin Li, and Deyu Meng. Aerogen: Enhancing remote sensing object detection with diffusion-driven data generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3614–3624, 2025. 1, 3, 6, 7

  47. [47]

    YOLOv12: Attention-Centric Real-Time Object Detectors

    Yunjie Tian, Qixiang Ye, and David Doermann. Yolov12: Attention-centric real-time object detectors.arXiv preprint arXiv:2502.12524, 2025. 2

  48. [48]

    Satsynth: Augmenting image-mask pairs through diffusion models for aerial semantic segmentation

    Aysim Toker, Marvin Eisenberger, Daniel Cremers, and Laura Leal-Taix´e. Satsynth: Augmenting image-mask pairs through diffusion models for aerial semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27695–27705, 2024. 3

  49. [49]

    Nvae: A deep hierarchical 10 variational autoencoder

    Arash Vahdat and Jan Kautz. Nvae: A deep hierarchical 10 variational autoencoder. InAdvances in Neural Information Processing Systems, pages 19667–19679, 2020. 2

  50. [50]

    Instancediffusion: Instance- level control for image generation

    Xudong Wang, Trevor Darrell, Sai Saketh Rambhatla, Ro- hit Girdhar, and Ishan Misra. Instancediffusion: Instance- level control for image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6232–6242, 2024. 3

  51. [51]

    Detdiffusion: Synergizing gen- erative and perceptive models for enhanced data generation and perception

    Yibo Wang, Ruiyuan Gao, Kai Chen, Kaiqiang Zhou, Yingjie Cai, Lanqing Hong, Zhenguo Li, Lihui Jiang, Dit- Yan Yeung, Qiang Xu, et al. Detdiffusion: Synergizing gen- erative and perceptive models for enhanced data generation and perception. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7246– 7255, 2024. 1

  52. [52]

    Domain adaptive ob- ject detection for uav-based images by robust representation learning and multiple pseudo-label aggregation

    Ke Wu, Jiaxin Chen, and Miao Wang. Domain adaptive ob- ject detection for uav-based images by robust representation learning and multiple pseudo-label aggregation. InProceed- ings of the ACM MM Workshops on Efficient Multimedia Computing under Limited, page 59–67, 2024. 1

  53. [53]

    Datasetdm: Synthesizing data with perception anno- tations using diffusion models

    Weijia Wu, Yuzhong Zhao, Hao Chen, Yuchao Gu, Rui Zhao, Yefei He, Hong Zhou, Mike Zheng Shou, and Chunhua Shen. Datasetdm: Synthesizing data with perception anno- tations using diffusion models. InAdvances in Neural Infor- mation Processing Systems, pages 54683–54695, 2023. 1, 3

  54. [54]

    Boxdiff: Text-to-image synthesis with training-free box-constrained diffusion

    Jinheng Xie, Yuexiang Li, Yawen Huang, Haozhe Liu, Wen- tian Zhang, Yefeng Zheng, and Mike Zheng Shou. Boxdiff: Text-to-image synthesis with training-free box-constrained diffusion. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7452–7461, 2023. 2

  55. [55]

    Reco: Region-controlled text-to-image genera- tion

    Zhengyuan Yang, Jianfeng Wang, Zhe Gan, Linjie Li, Kevin Lin, Chenfei Wu, Nan Duan, Zicheng Liu, Ce Liu, Michael Zeng, et al. Reco: Region-controlled text-to-image genera- tion. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 14246–14255,

  56. [56]

    Synplay: Importing real-world diversity for a synthetic hu- man dataset.arXiv e-prints, pages arXiv–2408, 2024

    Jinsub Yim, Hyungtae Lee, Sungmin Eum, Yi-Ting Shen, Yan Zhang, Heesung Kwon, and Shuvra S Bhattacharyya. Synplay: Importing real-world diversity for a synthetic hu- man dataset.arXiv e-prints, pages arXiv–2408, 2024. 3

  57. [57]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023. 2

  58. [58]

    Datasetgan: Efficient labeled data factory with minimal human effort

    Yuxuan Zhang, Huan Ling, Jun Gao, Kangxue Yin, Jean- Francois Lafleche, Adela Barriuso, Antonio Torralba, and Sanja Fidler. Datasetgan: Efficient labeled data factory with minimal human effort. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10145–10155, 2021. 3

  59. [59]

    X-paste: Revisiting scalable copy- paste for instance segmentation using clip and stablediffu- sion

    Hanqing Zhao, Dianmo Sheng, Jianmin Bao, Dongdong Chen, Dong Chen, Fang Wen, Lu Yuan, Ce Liu, Wenbo Zhou, Qi Chu, et al. X-paste: Revisiting scalable copy- paste for instance segmentation using clip and stablediffu- sion. InProceedings of the International Conference on Ma- chine Learning, pages 42098–42109, 2023. 1, 3

  60. [60]

    Layoutdiffusion: Controllable diffu- sion model for layout-to-image generation

    Guangcong Zheng, Xianpan Zhou, Xuewei Li, Zhongang Qi, Ying Shan, and Xi Li. Layoutdiffusion: Controllable diffu- sion model for layout-to-image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 22490–22499, 2023. 2

  61. [61]

    Migc: Multi-instance generation controller for text-to-image synthesis

    Dewei Zhou, You Li, Fan Ma, Xiaoting Zhang, and Yi Yang. Migc: Multi-instance generation controller for text-to-image synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6818– 6828, 2024. 2

  62. [62]

    Odgen: Domain- specific object detection data generation with diffusion mod- els

    Jingyuan Zhu, Shiyu Li, Yuxuan Andy Liu, Jian Yuan, Ping Huang, Jiulong Shan, and Huimin Ma. Odgen: Domain- specific object detection data generation with diffusion mod- els. InAdvances in Neural Information Processing Systems, pages 63599–63633, 2024. 1 11 Visual Prototype Conditioned Focal Region Generation for UA V-Based Object Detection Supplementary ...

  63. [63]

    All experiments were conducted on 8 NVIDIA RTX 3080Ti GPUs

    and UA VDT [8]. All experiments were conducted on 8 NVIDIA RTX 3080Ti GPUs. In terms of UA V-based object detection model, following the default experimental set of Remdet [28], we trained the detector on VisDrone dataset for 300 epochs, with a learn- ing rate of 0.01, and applied data augmentation techniques such as mixup and Mosaic. The input image size...