pith. sign in

arxiv: 1906.11172 · v1 · pith:LONYYBLWnew · submitted 2019-06-26 · 💻 cs.CV · cs.LG

Learning Data Augmentation Strategies for Object Detection

Pith reviewed 2026-05-25 15:34 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords data augmentationobject detectionCOCO datasetpolicy learningtransfer learningPASCAL-VOCimage classification
0
0 comments X

The pith

A learned data augmentation policy improves object detection accuracy by more than 2.3 mAP on COCO and transfers unchanged to other datasets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper first shows that data augmentation operations taken from image classification give only modest help when training detection models. It then learns specialized augmentation policies through search, keeping the policy active only during training. On the COCO dataset these policies raise accuracy by more than 2.3 mAP and let one model reach 50.7 mAP. The same policy improves a strong baseline on PASCAL-VOC by 2.7 mAP without any changes. The learned policies also outperform architecture regularization methods for detection.

Core claim

Experiments on the COCO dataset indicate that an optimized data augmentation policy improves detection accuracy by more than +2.3 mAP, and allow a single inference model to achieve a state-of-the-art accuracy of 50.7 mAP. Importantly, the best policy found on COCO may be transferred unchanged to other detection datasets and models to improve predictive accuracy. For example, the best augmentation policy identified with COCO improves a strong baseline on PASCAL-VOC by +2.7 mAP. Our results also reveal that a learned augmentation policy is superior to state-of-the-art architecture regularization methods for object detection.

What carries the argument

A learned augmentation policy: a collection of image transformations and their parameters discovered by automated search to maximize detection performance on a given dataset.

If this is right

  • A single inference model reaches 50.7 mAP on COCO.
  • The identical policy raises accuracy by 2.7 mAP on PASCAL-VOC without retraining or retuning.
  • Learned policies outperform architecture regularization methods on detection tasks.
  • Policies discovered on one detection dataset improve performance on other detection datasets and models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The transfer result suggests that the policy captures dataset-agnostic invariances that could apply to additional detection benchmarks.
  • Because the policy changes only training data, it can be combined with any existing detection architecture without altering inference.
  • The approach may lower the annotation cost for new detection tasks by extracting more signal from existing labeled images.

Load-bearing premise

The reported accuracy gains are produced by the learned augmentation policy itself rather than by differences in training schedule, optimizer, or random seed.

What would settle it

Reproduce the COCO experiments while holding training schedule, optimizer, and random seed fixed and differ only in the use of the learned policy versus the baseline augmentations; the mAP gap must disappear if the policy is not the cause.

Figures

Figures reproduced from arXiv: 1906.11172 by Barret Zoph, Ekin D. Cubuk, Golnaz Ghiasi, Jonathon Shlens, Quoc V. Le, Tsung-Yi Lin.

Figure 1
Figure 1. Figure 1: Learned augmentation policy systematically improves object detection performance. Left: Learned augmentation policy applied to example from COCO dataset [25]. Right: Mean average precision for RetinaNet [24] with a ResNet-50 backbone on COCO [25] with and without learned augmentation policy (red and black, respec￾tively). augmentation strategies [3, 42, 21]. In the image domain, common augmentations includ… view at source ↗
Figure 2
Figure 2. Figure 2: Examples of learned augmentation sub-policies. 5 examples of learned sub-policies applied to one example image. Each column corresponds to a different random sample of the corresponding sub-policy. Each step of an augmentation sub-policy consists of a triplet corresponding to the operation, the probability of application and a magnitude measure. The bounding box is adjusted to maintain consistency with the… view at source ↗
Figure 3
Figure 3. Figure 3: Percentage improvement in mAP for objects of [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Percentage improvement due to the learned aug [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Training loss vs. number of training examples for [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Data augmentation is a critical component of training deep learning models. Although data augmentation has been shown to significantly improve image classification, its potential has not been thoroughly investigated for object detection. Given the additional cost for annotating images for object detection, data augmentation may be of even greater importance for this computer vision task. In this work, we study the impact of data augmentation on object detection. We first demonstrate that data augmentation operations borrowed from image classification may be helpful for training detection models, but the improvement is limited. Thus, we investigate how learned, specialized data augmentation policies improve generalization performance for detection models. Importantly, these augmentation policies only affect training and leave a trained model unchanged during evaluation. Experiments on the COCO dataset indicate that an optimized data augmentation policy improves detection accuracy by more than +2.3 mAP, and allow a single inference model to achieve a state-of-the-art accuracy of 50.7 mAP. Importantly, the best policy found on COCO may be transferred unchanged to other detection datasets and models to improve predictive accuracy. For example, the best augmentation policy identified with COCO improves a strong baseline on PASCAL-VOC by +2.7 mAP. Our results also reveal that a learned augmentation policy is superior to state-of-the-art architecture regularization methods for object detection, even when considering strong baselines. Code for training with the learned policy is available online at https://github.com/tensorflow/tpu/tree/master/models/official/detection

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper studies data augmentation for object detection, first showing that classification-derived operations give limited gains, then proposing to learn specialized augmentation policies via search. On COCO this yields >+2.3 mAP (reaching 50.7 mAP SOTA with one model); the best COCO policy transfers unchanged to improve a strong PASCAL-VOC baseline by +2.7 mAP and outperforms architecture regularization methods.

Significance. If the reported mAP gains are shown to be caused by the learned policy rather than differences in training schedule, optimizer, or seeds, the work would be significant: it supplies the first demonstration of transferable, detection-specific augmentation policies and supplies reproducible code, strengthening the case that learned augmentation is a high-leverage, low-cost lever for detection.

major comments (2)
  1. [Abstract / Experimental results] The abstract and experimental description supply no explicit statement that the reported baseline and policy-augmented runs share identical optimizer, learning-rate schedule, number of epochs, data-loader settings, and random seeds. Without this control the +2.3 mAP delta cannot be isolated to the augmentation policy.
  2. [Methods / Policy search] No information is given on the search procedure itself (search space size, number of trials, validation-set usage during policy search, or whether the final reported numbers use the same validation split). This information is load-bearing for assessing whether the headline gains are statistically reliable.
minor comments (1)
  1. [Abstract] The GitHub link is provided but the manuscript does not state the exact commit or configuration files used to reproduce the 50.7 mAP result.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important aspects of experimental clarity and reproducibility. We respond to each major comment below and will revise the manuscript to incorporate clarifications.

read point-by-point responses
  1. Referee: [Abstract / Experimental results] The abstract and experimental description supply no explicit statement that the reported baseline and policy-augmented runs share identical optimizer, learning-rate schedule, number of epochs, data-loader settings, and random seeds. Without this control the +2.3 mAP delta cannot be isolated to the augmentation policy.

    Authors: We agree that an explicit statement would improve clarity. All experiments in the paper use identical training configurations (optimizer, learning-rate schedule, number of epochs, data-loader settings, and random seeds), with the only difference being the data augmentation policy. This is reflected in the released code at the provided GitHub link. We will add an explicit statement to the abstract and experimental results section confirming that all other factors are held constant across baseline and policy-augmented runs. revision: yes

  2. Referee: [Methods / Policy search] No information is given on the search procedure itself (search space size, number of trials, validation-set usage during policy search, or whether the final reported numbers use the same validation split). This information is load-bearing for assessing whether the headline gains are statistically reliable.

    Authors: We acknowledge that the methods section would benefit from expanded details on the search procedure. The policy search is conducted on a held-out validation split distinct from the test set used for final COCO and transfer results. We will revise the methods section to specify the search space size, number of trials performed, and confirmation that the validation split for search is separate from the evaluation test split. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical gains validated on held-out COCO test set and transferred to independent VOC dataset

full rationale

The paper reports empirical improvements from searched augmentation policies, measured on held-out COCO test data (+2.3 mAP) and transferred unchanged to PASCAL-VOC (+2.7 mAP). No derivation chain, equations, or self-citations reduce any claimed result to a fitted parameter or input by construction. The central claims rest on standard train/test splits and cross-dataset transfer rather than self-definition or load-bearing self-citation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The paper introduces no new mathematical entities or axioms. It relies on the standard supervised-learning assumption that training and test distributions are similar enough for augmentation policies discovered on one to remain useful on the other, and on the empirical effectiveness of the (unspecified) policy-search algorithm.

free parameters (1)
  • augmentation policy
    The policy itself is the output of a search procedure run on COCO; its concrete parameters are therefore fitted to that dataset.
axioms (1)
  • domain assumption Standard i.i.d. supervised learning assumptions hold for the COCO and PASCAL-VOC image distributions.
    All reported mAP numbers presuppose that the train and test splits are representative samples from the same underlying distribution.

pith-pipeline@v0.9.0 · 5814 in / 1427 out tokens · 41074 ms · 2026-05-25T15:34:36.524221+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 19 internal anchors

  1. [1]

    Abadi, P

    M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Is- ard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V . Vasudevan, P. Warden, M. Wicke, Y . Yu, and X. Zheng. Tensor- flow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Conference on Oper- ati...

  2. [2]

    Data Augmentation Generative Adversarial Networks

    A. Antoniou, A. Storkey, and H. Edwards. Data augmentation generative adversarial networks. arXiv preprint arXiv:1711.04340, 2017. 1

  3. [3]

    H. S. Baird. Document image defect models. In Structured Document Image Analysis, pages 546–556. Springer, 1992. 1

  4. [4]

    Ciregan, U

    D. Ciregan, U. Meier, and J. Schmidhuber. Multi- column deep neural networks for image classifica- tion. In Proceedings of IEEE Conference on Com- puter Vision and Pattern Recognition , pages 3642–

  5. [5]

    E. D. Cubuk, B. Zoph, D. Mane, V . Vasudevan, and Q. V . Le. Autoaugment: Learning augmentation poli- cies from data. arXiv preprint arXiv:1805.09501 ,

  6. [6]

    E. D. Cubuk, B. Zoph, S. S. Schoenholz, and Q. V . Le. Intriguing properties of adversarial examples. arXiv preprint arXiv:1711.02846, 2017. 4

  7. [7]

    J. Dai, K. He, and J. Sun. Instance-aware semantic segmentation via multi-task network cascades. InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3150–3158, 2016. 8

  8. [8]

    Dataset Augmentation in Feature Space

    T. DeVries and G. W. Taylor. Dataset augmentation in feature space. arXiv preprint arXiv:1702.05538, 2017. 1, 2

  9. [9]

    Improved Regularization of Convolutional Neural Networks with Cutout

    T. DeVries and G. W. Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017. 2, 13

  10. [10]

    Dwibedi, I

    D. Dwibedi, I. Misra, and M. Hebert. Cut, paste and learn: Surprisingly easy synthesis for instance detec- tion. In Proceedings of the IEEE International Con- ference on Computer Vision, pages 1301–1310, 2017. 2

  11. [11]

    Everingham, L

    M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vi- sion, 88(2):303–338, 2010. 6, 7

  12. [12]

    N. Ford, J. Gilmer, N. Carlini, and D. Cubuk. Adver- sarial examples are a natural consequence of test error in noise. arXiv preprint arXiv:1901.10513, 2019. 2

  13. [13]

    Ghiasi, T.-Y

    G. Ghiasi, T.-Y . Lin, and Q. V . Le. DropBlock: A regularization method for convolutional networks. In Advances in Neural Information Processing Systems , pages 10750–10760, 2018. 2, 5, 7

  14. [14]

    Ghiasi, T.-Y

    G. Ghiasi, T.-Y . Lin, R. Pang, and Q. V . Le. NAS-FPN: Learning scalable feature pyramid architecture for ob- ject detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019. 5, 7

  15. [15]

    Girshick, I

    R. Girshick, I. Radosavovic, G. Gkioxari, P. Doll ´ar, and K. He. Detectron, 2018. 2

  16. [16]

    K. He, X. Zhang, S. Ren, and J. Sun. Deep resid- ual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. 1, 4

  17. [17]

    D. Ho, E. Liang, I. Stoica, P. Abbeel, and X. Chen. Population based augmentation: Efficient learning of augmentation policy schedules. arXiv preprint arXiv:1905.05393, 2019. 2, 4

  18. [18]

    J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. arXiv preprint arXiv:1709.01507, 2017. 1

  19. [19]

    Huang, V

    J. Huang, V . Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y . Song, S. Guadar- rama, et al. Speed/accuracy trade-offs for modern convolutional object detectors. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7310–7311, 2017. 6

  20. [20]

    N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, et al. In-datacenter performance analysis of a tensor processing unit. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Archi- tecture (ISCA), pages 1–12. IEEE, 2017. 4, 5

  21. [21]

    Krizhevsky, I

    A. Krizhevsky, I. Sutskever, and G. E. Hinton. Im- agenet classification with deep convolutional neural networks. In Advances in Neural Information Process- ing Systems, 2012. 1, 2

  22. [22]

    Lemley, S

    J. Lemley, S. Bazrafkan, and P. Corcoran. Smart augmentation learning an optimal data augmentation strategy. IEEE Access, 5:5858–5869, 2017. 1, 2

  23. [23]

    S. Lim, I. Kim, T. Kim, C. Kim, and S. Kim. Fast autoaugment. arXiv preprint arXiv:1905.00397, 2019. 2, 4

  24. [24]

    T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Doll´ar. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vi- sion, pages 2980–2988, 2017. 1, 4, 5

  25. [25]

    T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick. Microsoft coco: Common objects in context. In European con- ference on computer vision, pages 740–755. Springer,

  26. [26]

    C. Liu, B. Zoph, J. Shlens, W. Hua, L.-J. Li, L. Fei- Fei, A. Yuille, J. Huang, and K. Murphy. Pro- gressive neural architecture search. arXiv preprint arXiv:1712.00559, 2017. 4

  27. [27]

    W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y . Fu, and A. C. Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016. 2

  28. [28]

    J. Long, E. Shelhamer, and T. Darrell. Fully convo- lutional networks for semantic segmentation. In Pro- ceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015. 8

  29. [29]

    R. G. Lopes, D. Yin, B. Poole, J. Gilmer, and E. D. Cubuk. Improving robustness without sacrificing accuracy with patch gaussian augmentation. arXiv preprint arXiv:1906.02611, 2019. 2

  30. [30]

    SGDR: Stochastic Gradient Descent with Warm Restarts

    I. Loshchilov and F. Hutter. SGDR: Stochastic gra- dient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016. 4

  31. [31]

    S. Mun, S. Park, D. K. Han, and H. Ko. Generative adversarial network based acoustic scene training set augmentation and selection using svm hyper-plane. In Detection and Classification of Acoustic Scenes and Events Workshop, 2017. 1

  32. [32]

    C. Peng, T. Xiao, Z. Li, Y . Jiang, X. Zhang, K. Jia, G. Yu, and J. Sun. Megdet: A large mini-batch object detector. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018. 6, 7

  33. [33]

    The Effectiveness of Data Augmentation in Image Classification using Deep Learning

    L. Perez and J. Wang. The effectiveness of data aug- mentation in image classification using deep learning. arXiv preprint arXiv:1712.04621, 2017. 1

  34. [34]

    P. O. Pinheiro, T.-Y . Lin, R. Collobert, and P. Doll ´ar. Learning to refine object segments. InEuropean Con- ference on Computer Vision , pages 75–91. Springer,

  35. [35]

    C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 652–660, 2017. 8

  36. [36]

    C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Pro- cessing Systems, pages 5099–5108, 2017. 8

  37. [37]

    A. J. Ratner, H. Ehrenberg, Z. Hussain, J. Dunnmon, and C. R ´e. Learning to compose domain-specific transformations for data augmentation. InAdvances in Neural Information Processing Systems, pages 3239– 3249, 2017. 1, 2

  38. [38]

    E. Real, A. Aggarwal, Y . Huang, and Q. V . Le. Reg- ularized evolution for image classifier architecture search. In Thirty-Third AAAI Conference on Artificial Intelligence, 2019. 4, 5, 7

  39. [39]

    S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region pro- posal networks. In Advances in neural information processing systems, pages 91–99, 2015. 6, 7

  40. [40]

    I. Sato, H. Nishimura, and K. Yokoi. Apac: Aug- mented pattern classification with neural networks. arXiv preprint arXiv:1505.03229, 2015. 2

  41. [41]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. 4

  42. [42]

    P. Y . Simard, D. Steinkraus, J. C. Platt, et al. Best practices for convolutional neural networks applied to visual document analysis. In Proceedings of Interna- tional Conference on Document Analysis and Recog- nition, 2003. 1, 2

  43. [43]

    L. Sixt, B. Wild, and T. Landgraf. Rendergan: Generating realistic labeled data. arXiv preprint arXiv:1611.01331, 2016. 1

  44. [44]

    Szegedy, W

    C. Szegedy, W. Liu, Y . Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V . Vanhoucke, A. Rabinovich, et al. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pat- tern Recognition (CVPR), 2015. 1

  45. [45]

    T. Tran, T. Pham, G. Carneiro, L. Palmer, and I. Reid. A bayesian data augmentation approach for learning deep models. In Advances in Neural Information Pro- cessing Systems, pages 2794–2803, 2017. 1, 2

  46. [46]

    Manifold Mixup: Better Representations by Interpolating Hidden States

    V . Verma, A. Lamb, C. Beckham, A. Courville, I. Mitliagkis, and Y . Bengio. Manifold mixup: En- couraging meaningful on-manifold interpolation as a regularizer. arXiv preprint arXiv:1806.05236, 2018. 7

  47. [47]

    L. Wan, M. Zeiler, S. Zhang, Y . Le Cun, and R. Fer- gus. Regularization of neural networks using dropcon- nect. In International Conference on Machine Learn- ing, pages 1058–1066, 2013. 2

  48. [48]

    X. Wang, A. Shrivastava, and A. Gupta. A-fast- rcnn: Hard positive generation via adversary for ob- ject detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 2606–2615, 2017. 2

  49. [49]

    T. Yang, X. Zhang, Z. Li, W. Zhang, and J. Sun. Metaanchor: Learning to detect objects with cus- tomized anchors. In Advances in Neural Information Processing Systems, pages 318–328, 2018. 6

  50. [50]

    D. Yin, R. G. Lopes, J. Shlens, E. D. Cubuk, and J. Gilmer. A fourier perspective on model robustness in computer vision. arXiv preprint arXiv:1906.08988,

  51. [51]

    Zagoruyko and N

    S. Zagoruyko and N. Komodakis. Wide residual net- works. In British Machine Vision Conference, 2016. 2

  52. [52]

    mixup: Beyond Empirical Risk Minimization

    H. Zhang, M. Cisse, Y . N. Dauphin, and D. Lopez-Paz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017. 7

  53. [53]

    Random Erasing Data Augmentation

    Z. Zhong, L. Zheng, G. Kang, S. Li, and Y . Yang. Random erasing data augmentation. arXiv preprint arXiv:1708.04896, 2017. 2, 13

  54. [54]

    X. Zhu, Y . Liu, Z. Qin, and J. Li. Data augmentation in emotion classification using generative adversarial networks. arXiv preprint arXiv:1711.00648, 2017. 1

  55. [55]

    Zoph and Q

    B. Zoph and Q. V . Le. Neural architecture search with reinforcement learning. In International Conference on Learning Representations, 2017. 4

  56. [56]

    B. Zoph, V . Vasudevan, J. Shlens, and Q. V . Le. Learning transferable architectures for scalable image recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017. 1, 4 A. Appendix Operation Name Description Range of magnitudes ShearX(Y) Shear the image and the corners of the bounding boxes along the horizontal (verti...