pith. machine review for the scientific record. sign in

arxiv: 2604.11162 · v1 · submitted 2026-04-13 · 💻 cs.CV

Boxes2Pixels: Learning Defect Segmentation from Noisy SAM Masks

Pith reviewed 2026-05-10 16:35 UTC · model grok-4.3

classification 💻 cs.CV
keywords defect segmentationSAM pseudo-masksweak supervisiononline self-correctionindustrial inspectionbox-to-pixel distillationanomaly detection
0
0 comments X

The pith

Treating SAM as a noisy teacher lets a compact student learn accurate defect segmentation from bounding boxes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that pseudo-masks generated by SAM from bounding boxes remain useful for training defect segmentation models when treated as noisy supervision rather than ground truth. This would matter because dense pixel annotations are rarely available for industrial surfaces, while boxes are inexpensive to collect. The method freezes a feature backbone for stability, adds a decoder and separate binary head to handle localization, and applies one-sided online self-correction to override the teacher's background labels when the student detects defects confidently. If correct, weak box supervision could support reliable pixel-level inspection without full manual labeling.

Core claim

The paper claims that a noise-robust box-to-pixel distillation framework converts bounding boxes into SAM pseudo-masks offline and trains a compact student using a hierarchical decoder over frozen DINOv2 features, an auxiliary binary localization head, and a one-sided online self-correction that relaxes background supervision when the student is confident, targeting teacher false negatives. This produces higher anomaly mIoU and binary IoU than baselines trained identically on the same weak labels, plus higher binary recall, all with 80 percent fewer trainable parameters on a wind turbine inspection benchmark.

What carries the argument

The one-sided online self-correction mechanism that relaxes background supervision for confident student foreground predictions while treating SAM outputs as noisy labels.

If this is right

  • The student achieves higher anomaly mIoU and binary IoU than the strongest baseline under identical weak supervision from bounding boxes.
  • Online self-correction specifically raises binary recall by addressing missed defects in the teacher masks.
  • The trained model requires 80 percent fewer trainable parameters than alternative approaches.
  • Performance stems from viewing SAM as a noisy source rather than using its outputs directly as supervision.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The self-correction technique could transfer to distillation from other foundation models in segmentation tasks.
  • The framework may lower annotation costs for defect detection in additional industrial domains such as manufacturing lines.
  • Freezing the backbone features allows the approach to adapt to new surface types without retraining the entire network.

Load-bearing premise

The one-sided online self-correction reliably detects when to relax background supervision without introducing new false positives or harming generalization on unseen industrial surfaces.

What would settle it

Testing the full method on a new set of industrial surface images and finding that self-correction either lowers overall accuracy or adds false defect detections in background regions would show the mechanism fails to work as intended.

Figures

Figures reproduced from arXiv: 2604.11162 by Camile Lendering, Egor Bondarev, Erkut Akdag.

Figure 1
Figure 1. Figure 1: Overview of the proposed framework, Boxes2Pixels, for [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the hierarchical student architecture. The global semantic branch employs DINOv2 ViT-S/14 with BitFit adaptation [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison on wind turbine blade inspection [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

Accurate defect segmentation is critical for industrial inspection, yet dense pixel-level annotations are rarely available. A common workaround is to convert inexpensive bounding boxes into pseudo-masks using foundation segmentation models such as the Segment Anything Model (SAM). However, these pseudo-labels are systematically noisy on industrial surfaces, often hallucinating background structure while missing sparse defects. To address this limitation, a noise-robust box-to-pixel distillation framework, Boxes2Pixels, is proposed that treats SAM as a noisy teacher rather than a source of ground-truth supervision. Bounding boxes are converted into pseudo-masks offline by SAM, and a compact student is trained with (i) a hierarchical decoder over frozen DINOv2 features for semantic stability, (ii) an auxiliary binary localization head to decouple sparse foreground discovery from class prediction, and (iii) a one-sided online self-correction mechanism that relaxes background supervision when the student is confident, targeting teacher false negatives. On a manually annotated wind turbine inspection benchmark, the proposed Boxes2Pixels improves anomaly mIoU by +6.97 and binary IoU by +9.71 over the strongest baseline trained under identical weak supervision. Moreover, online self-correction increases the binary recall by +18.56, while the model employs 80\% fewer trainable parameters. Code is available at https://github.com/CLendering/Boxes2Pixels.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Boxes2Pixels, a noise-robust distillation framework for defect segmentation that converts bounding boxes to pseudo-masks via SAM and trains a compact student model using frozen DINOv2 features with a hierarchical decoder, an auxiliary binary localization head, and a one-sided online self-correction mechanism to relax background supervision on teacher false negatives. On a manually annotated wind turbine inspection benchmark, it reports gains of +6.97 anomaly mIoU and +9.71 binary IoU over the strongest baseline under identical weak supervision, with the self-correction adding +18.56 to binary recall; the model uses 80% fewer trainable parameters and code is released.

Significance. If the empirical gains hold under the reported conditions, the work provides a practical, parameter-efficient approach to leveraging noisy foundation-model pseudo-labels for industrial anomaly segmentation under weak supervision. The code release and focus on a real-world benchmark strengthen reproducibility and potential for deployment in inspection pipelines.

major comments (2)
  1. [Method (self-correction)] The one-sided online self-correction mechanism (described in the method) relies on the student's confidence threshold to relax background supervision. No details are provided on threshold selection, validation on a held-out set, or measured effects on precision/false-positive rate. This is load-bearing for the +18.56 recall claim, as miscalibration on unseen industrial textures risks introducing new false positives.
  2. [Experiments] Experimental section: the reported +6.97 anomaly mIoU and +9.71 binary IoU improvements are given against the strongest baseline, but lack explicit confirmation that all baselines use identical decoder/head architecture, number of training runs with variance, or ablation isolating each component's contribution.
minor comments (2)
  1. [Abstract] Abstract: define or clarify the distinction between 'anomaly mIoU' and 'binary IoU'; specify the reference model size for the '80% fewer trainable parameters' statement.
  2. [Figures/Tables] Figure and table captions should explicitly state whether results are averaged over multiple seeds or single-run.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation for minor revision. We address each major comment point-by-point below with clarifications and planned revisions.

read point-by-point responses
  1. Referee: [Method (self-correction)] The one-sided online self-correction mechanism (described in the method) relies on the student's confidence threshold to relax background supervision. No details are provided on threshold selection, validation on a held-out set, or measured effects on precision/false-positive rate. This is load-bearing for the +18.56 recall claim, as miscalibration on unseen industrial textures risks introducing new false positives.

    Authors: We acknowledge that the original manuscript provided insufficient detail on the self-correction threshold. In the revised version we will expand the method section to describe the threshold selection procedure, including its determination on a small held-out subset of the training data (without touching the test set) and the observed trade-off between recall gains and precision. We will also report the measured impact on false-positive rate on the wind-turbine benchmark, confirming that the reported recall improvement does not come at the cost of a substantial increase in false positives. These additions will directly address the concern about miscalibration on industrial textures. revision: yes

  2. Referee: [Experiments] Experimental section: the reported +6.97 anomaly mIoU and +9.71 binary IoU improvements are given against the strongest baseline, but lack explicit confirmation that all baselines use identical decoder/head architecture, number of training runs with variance, or ablation isolating each component's contribution.

    Authors: We confirm that every baseline was trained with the identical hierarchical decoder and auxiliary localization head as the proposed student model, ensuring the comparison occurs under identical weak-supervision conditions. The revised manuscript will explicitly state this architectural parity. We will additionally report results averaged over three independent training runs with standard deviation to quantify variance, and we will insert a new ablation table that isolates the contribution of the auxiliary head and the self-correction mechanism to the observed gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity; central results are empirical performance metrics

full rationale

The paper presents an empirical framework for training a compact student model on noisy SAM pseudo-masks generated from bounding boxes. Reported gains (+6.97 anomaly mIoU, +9.71 binary IoU, +18.56 recall) are measured on a held-out manually annotated wind-turbine benchmark under identical weak-supervision conditions. No equations, derivations, or first-principles results are given that reduce these quantities to fitted parameters, self-referential definitions, or self-citation chains. The one-sided self-correction mechanism is a training heuristic whose effect is validated externally by the benchmark numbers rather than being tautological with its own inputs. The work is therefore self-contained against external evaluation.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The approach rests on standard machine-learning assumptions about feature stability and noise patterns rather than new mathematical axioms or invented physical entities; a small number of training hyperparameters are implicit but not load-bearing for the conceptual claim.

free parameters (1)
  • self-correction confidence threshold
    Hyperparameter controlling when background supervision is relaxed; value not stated in abstract but required for the one-sided mechanism.
axioms (2)
  • domain assumption Frozen DINOv2 features supply semantically stable representations suitable for industrial defect images
    Invoked to justify the hierarchical decoder design for stability.
  • domain assumption SAM pseudo-masks exhibit systematic false-negative bias on sparse industrial defects that self-correction can mitigate
    Core premise enabling the one-sided relaxation strategy.

pith-pipeline@v0.9.0 · 5548 in / 1276 out tokens · 33317 ms · 2026-05-10T16:35:25.171436+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 3 internal anchors

  1. [1]

    Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation

    Jiwoon Ahn and Suha Kwak. Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4981–4990, 2018. 2

  2. [2]

    Emerging properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021. 2

  3. [3]

    Weakly-supervised semantic segmentation with image-level labels: from traditional models to foundation models.ACM Computing Surveys, 57(5):1–29, 2025

    Zhaozheng Chen and Qianru Sun. Weakly-supervised semantic segmentation with image-level labels: from traditional models to foundation models.ACM Computing Surveys, 57(5):1–29, 2025. 3

  4. [4]

    Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation

    Jifeng Dai, Kaiming He, and Jian Sun. Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation. InProceedings of the IEEE international conference on computer vision, pages 1635–1643, 2015. 2

  5. [5]

    Drone footage wind turbine surface damage detection

    Ashley Foster, Oscar Best, Mario Gianni, Asiya Khan, Keri Collins, and Sanjay Sharma. Drone footage wind turbine surface damage detection. In2022 IEEE 14th Image, Video, and Multidimensional Signal Processing Workshop (IVMSP), pages 1–5. IEEE, 2022. 1, 6

  6. [6]

    arXiv preprint arXiv:2203.08414 , year=

    Mark Hamilton, Zhoutong Zhang, Bharath Hariharan, Noah Snavely, and William T Freeman. Unsupervised semantic segmentation by distilling feature correspondences.arXiv preprint arXiv:2203.08414, 2022. 3

  7. [7]

    Co-teaching: Robust training of deep neural networks with extremely noisy labels.Advances in neural information processing systems, 31, 2018

    Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor Tsang, and Masashi Sugiyama. Co-teaching: Robust training of deep neural networks with extremely noisy labels.Advances in neural information processing systems, 31, 2018. 3

  8. [8]

    Seyed Raein Hashemi, Seyed Sadegh Mohseni Salehi, Deniz Erdogmus, Sanjay P Prabhu, Simon K Warfield, and Ali Gholipour. Asymmetric loss functions and deep densely-connected networks for highly-imbalanced medical image segmentation: Application to multiple sclerosis lesion detection.IEEE Access, 7:1721–1735, 2018. 5

  9. [9]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022. 2

  10. [10]

    Winclip: Zero-/few-shot anomaly classification and segmentation

    Jongheon Jeong, Yang Zou, Taewan Kim, Dongqing Zhang, Avinash Ravichandran, and Onkar Dabeer. Winclip: Zero-/few-shot anomaly classification and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19606–19616, 2023. 2

  11. [11]

    What uncertainties do we need in bayesian deep learning for computer vision?Advances in neural information processing systems, 30, 2017

    Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision?Advances in neural information processing systems, 30, 2017. 3

  12. [12]

    Segment anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InProceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023. 2, 3

  13. [13]

    Discobox: Weakly supervised instance segmentation and semantic correspondence from box supervision

    Shiyi Lan, Zhiding Yu, Christopher Choy, Subhashree Radhakrishnan, Guilin Liu, Yuke Zhu, Larry S Davis, and Anima Anandkumar. Discobox: Weakly supervised instance segmentation and semantic correspondence from box supervision. InProceedings of the IEEE/CVF international conference on computer vision, pages 3406–3416, 2021. 2

  14. [14]

    Dividemix: Learning with noisy labelsassemi-supervisedlearning

    Junnan Li, Richard Socher, and Steven CH Hoi. Dividemix: Learning with noisy labels as semi-supervised learning. arXiv preprint arXiv:2002.07394, 2020. 3

  15. [15]

    Selective-supervised contrastive learning with noisy labels

    Shikun Li, Xiaobo Xia, Shiming Ge, and Tongliang Liu. Selective-supervised contrastive learning with noisy labels. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 316–325, 2022. 3

  16. [16]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 2

  17. [17]

    Ssd: Single shot multibox detector

    Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. InEuropean conference on computer vision, pages 21–37. Springer, 2016. 1

  18. [18]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 5

  19. [19]

    Learning with noisy labels

    Nagarajan Natarajan, Inderjit S Dhillon, Pradeep K Ravikumar, and Ambuj Tewari. Learning with noisy labels. Advances in neural information processing systems, 26, 2013. 3

  20. [20]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth ´ee Darcet, Th´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 2, 3, 4

  21. [21]

    Making deep neural networks robust to label noise: A loss correction approach

    Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock, and Lizhen Qu. Making deep neural networks robust to label noise: A loss correction approach. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1944–1952, 2017. 3

  22. [22]

    Acceleration of stochastic approximation by averaging.SIAM journal on control and optimization, 30(4):838–855, 1992

    Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging.SIAM journal on control and optimization, 30(4):838–855, 1992. 5

  23. [23]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2

  24. [24]

    You only look once: Unified, real-time object detection

    Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788,

  25. [25]

    Boundary-enhanced co-training for weakly supervised semantic segmentation

    Shenghai Rong, Bohai Tu, Zilei Wang, and Junjie Li. Boundary-enhanced co-training for weakly supervised semantic segmentation. InProceedings of the IEEE/CVF 9 conference on computer vision and pattern recognition, pages 19574–19584, 2023. 3

  26. [26]

    U-net: Convolutional networks for biomedical image segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer,

  27. [27]

    Towards total recall in industrial anomaly detection

    Karsten Roth, Latha Pemula, Joaquin Zepeda, Bernhard Sch¨olkopf, Thomas Brox, and Peter Gehler. Towards total recall in industrial anomaly detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14318–14328, 2022. 2

  28. [28]

    Grad-cam: Visual explanations from deep networks via gradient-based localization

    Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. InProceedings of the IEEE international conference on computer vision, pages 618–626,

  29. [29]

    Dtu-drone inspection images of wind turbine

    ASM Shihavuddin and X Chen. Dtu-drone inspection images of wind turbine. mendeley data, v2, 2018. 6

  30. [30]

    Learning from noisy labels by regularized estimation of annotator confusion

    Ryutaro Tanno, Ardavan Saeedi, Swami Sankaranarayanan, Daniel C Alexander, and Nathan Silberman. Learning from noisy labels by regularized estimation of annotator confusion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11244–11253, 2019. 3

  31. [31]

    Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results.Advances in neural information processing systems, 30, 2017

    Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results.Advances in neural information processing systems, 30, 2017. 3

  32. [32]

    Boxinst: High-performance instance segmentation with box annotations

    Zhi Tian, Chunhua Shen, Xinlong Wang, and Hao Chen. Boxinst: High-performance instance segmentation with box annotations. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5443–5452,

  33. [33]

    Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation

    Yude Wang, Jie Zhang, Meina Kan, Shiguang Shan, and Xilin Chen. Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12275–12284, 2020. 2

  34. [34]

    Robust early-learning: Hindering the memorization of noisy labels

    Xiaobo Xia, Tongliang Liu, Bo Han, Chen Gong, Nannan Wang, Zongyuan Ge, and Yi Chang. Robust early-learning: Hindering the memorization of noisy labels. InInternational conference on learning representations, 2020. 3

  35. [35]

    Segformer: Simple and efficient design for semantic segmentation with transformers

    Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in neural information processing systems, 34: 12077–12090, 2021. 2

  36. [36]

    Uncertainty-aware self-ensembling model for semi-supervised 3d left atrium segmentation

    Lequan Yu, Shujun Wang, Xiaomeng Li, Chi-Wing Fu, and Pheng-Ann Heng. Uncertainty-aware self-ensembling model for semi-supervised 3d left atrium segmentation. In International conference on medical image computing and computer-assisted intervention, pages 605–613. Springer,

  37. [37]

    Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models

    Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1–9, 2022. 4

  38. [38]

    Enhancing wind turbine blade damage detection with yolo-wind.Scientific Reports, 15(1): 18667, 2025

    Zhao Zhanfang and Li Tuo. Enhancing wind turbine blade damage detection with yolo-wind.Scientific Reports, 15(1): 18667, 2025. 1

  39. [39]

    Faster Segment Anything: Towards Lightweight SAM for Mobile Applications

    Chaoning Zhang, Dongshen Han, Yu Qiao, Jung Uk Kim, Sung-Ho Bae, Seungkyu Lee, and Choong Seon Hong. Faster segment anything: Towards lightweight sam for mobile applications.arXiv preprint arXiv:2306.14289, 2023. 3

  40. [40]

    Efficientvit-sam: Accelerated segment anything model without performance loss

    Zhuoyang Zhang, Han Cai, and Song Han. Efficientvit-sam: Accelerated segment anything model without performance loss. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7859–7863,

  41. [41]

    A survey on segment anything model (sam): Vision foundation model meets prompt engineering

    Xu Zhao, Wenchao Ding, Yongqi An, Yinglong Du, Tao Yu, Min Li, Ming Tang, and Jinqiao Wang. Fast segment anything.arXiv preprint arXiv:2306.12156, 2023. 3

  42. [42]

    Sam-driven weakly supervised nodule segmentation with uncertainty-aware cross teaching

    Xingyue Zhao, Peiqi Li, Xiangde Luo, Meng Yang, Shi Chang, and Zhongyu Li. Sam-driven weakly supervised nodule segmentation with uncertainty-aware cross teaching. In2024 IEEE International Symposium on Biomedical Imaging (ISBI), pages 1–5. IEEE, 2024. 3

  43. [43]

    Learning deep features for discriminative localization

    Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2921–2929, 2016. 2 10