Boxes2Pixels: Learning Defect Segmentation from Noisy SAM Masks
Pith reviewed 2026-05-10 16:35 UTC · model grok-4.3
The pith
Treating SAM as a noisy teacher lets a compact student learn accurate defect segmentation from bounding boxes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that a noise-robust box-to-pixel distillation framework converts bounding boxes into SAM pseudo-masks offline and trains a compact student using a hierarchical decoder over frozen DINOv2 features, an auxiliary binary localization head, and a one-sided online self-correction that relaxes background supervision when the student is confident, targeting teacher false negatives. This produces higher anomaly mIoU and binary IoU than baselines trained identically on the same weak labels, plus higher binary recall, all with 80 percent fewer trainable parameters on a wind turbine inspection benchmark.
What carries the argument
The one-sided online self-correction mechanism that relaxes background supervision for confident student foreground predictions while treating SAM outputs as noisy labels.
If this is right
- The student achieves higher anomaly mIoU and binary IoU than the strongest baseline under identical weak supervision from bounding boxes.
- Online self-correction specifically raises binary recall by addressing missed defects in the teacher masks.
- The trained model requires 80 percent fewer trainable parameters than alternative approaches.
- Performance stems from viewing SAM as a noisy source rather than using its outputs directly as supervision.
Where Pith is reading between the lines
- The self-correction technique could transfer to distillation from other foundation models in segmentation tasks.
- The framework may lower annotation costs for defect detection in additional industrial domains such as manufacturing lines.
- Freezing the backbone features allows the approach to adapt to new surface types without retraining the entire network.
Load-bearing premise
The one-sided online self-correction reliably detects when to relax background supervision without introducing new false positives or harming generalization on unseen industrial surfaces.
What would settle it
Testing the full method on a new set of industrial surface images and finding that self-correction either lowers overall accuracy or adds false defect detections in background regions would show the mechanism fails to work as intended.
Figures
read the original abstract
Accurate defect segmentation is critical for industrial inspection, yet dense pixel-level annotations are rarely available. A common workaround is to convert inexpensive bounding boxes into pseudo-masks using foundation segmentation models such as the Segment Anything Model (SAM). However, these pseudo-labels are systematically noisy on industrial surfaces, often hallucinating background structure while missing sparse defects. To address this limitation, a noise-robust box-to-pixel distillation framework, Boxes2Pixels, is proposed that treats SAM as a noisy teacher rather than a source of ground-truth supervision. Bounding boxes are converted into pseudo-masks offline by SAM, and a compact student is trained with (i) a hierarchical decoder over frozen DINOv2 features for semantic stability, (ii) an auxiliary binary localization head to decouple sparse foreground discovery from class prediction, and (iii) a one-sided online self-correction mechanism that relaxes background supervision when the student is confident, targeting teacher false negatives. On a manually annotated wind turbine inspection benchmark, the proposed Boxes2Pixels improves anomaly mIoU by +6.97 and binary IoU by +9.71 over the strongest baseline trained under identical weak supervision. Moreover, online self-correction increases the binary recall by +18.56, while the model employs 80\% fewer trainable parameters. Code is available at https://github.com/CLendering/Boxes2Pixels.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Boxes2Pixels, a noise-robust distillation framework for defect segmentation that converts bounding boxes to pseudo-masks via SAM and trains a compact student model using frozen DINOv2 features with a hierarchical decoder, an auxiliary binary localization head, and a one-sided online self-correction mechanism to relax background supervision on teacher false negatives. On a manually annotated wind turbine inspection benchmark, it reports gains of +6.97 anomaly mIoU and +9.71 binary IoU over the strongest baseline under identical weak supervision, with the self-correction adding +18.56 to binary recall; the model uses 80% fewer trainable parameters and code is released.
Significance. If the empirical gains hold under the reported conditions, the work provides a practical, parameter-efficient approach to leveraging noisy foundation-model pseudo-labels for industrial anomaly segmentation under weak supervision. The code release and focus on a real-world benchmark strengthen reproducibility and potential for deployment in inspection pipelines.
major comments (2)
- [Method (self-correction)] The one-sided online self-correction mechanism (described in the method) relies on the student's confidence threshold to relax background supervision. No details are provided on threshold selection, validation on a held-out set, or measured effects on precision/false-positive rate. This is load-bearing for the +18.56 recall claim, as miscalibration on unseen industrial textures risks introducing new false positives.
- [Experiments] Experimental section: the reported +6.97 anomaly mIoU and +9.71 binary IoU improvements are given against the strongest baseline, but lack explicit confirmation that all baselines use identical decoder/head architecture, number of training runs with variance, or ablation isolating each component's contribution.
minor comments (2)
- [Abstract] Abstract: define or clarify the distinction between 'anomaly mIoU' and 'binary IoU'; specify the reference model size for the '80% fewer trainable parameters' statement.
- [Figures/Tables] Figure and table captions should explicitly state whether results are averaged over multiple seeds or single-run.
Simulated Author's Rebuttal
We thank the referee for the positive summary, significance assessment, and recommendation for minor revision. We address each major comment point-by-point below with clarifications and planned revisions.
read point-by-point responses
-
Referee: [Method (self-correction)] The one-sided online self-correction mechanism (described in the method) relies on the student's confidence threshold to relax background supervision. No details are provided on threshold selection, validation on a held-out set, or measured effects on precision/false-positive rate. This is load-bearing for the +18.56 recall claim, as miscalibration on unseen industrial textures risks introducing new false positives.
Authors: We acknowledge that the original manuscript provided insufficient detail on the self-correction threshold. In the revised version we will expand the method section to describe the threshold selection procedure, including its determination on a small held-out subset of the training data (without touching the test set) and the observed trade-off between recall gains and precision. We will also report the measured impact on false-positive rate on the wind-turbine benchmark, confirming that the reported recall improvement does not come at the cost of a substantial increase in false positives. These additions will directly address the concern about miscalibration on industrial textures. revision: yes
-
Referee: [Experiments] Experimental section: the reported +6.97 anomaly mIoU and +9.71 binary IoU improvements are given against the strongest baseline, but lack explicit confirmation that all baselines use identical decoder/head architecture, number of training runs with variance, or ablation isolating each component's contribution.
Authors: We confirm that every baseline was trained with the identical hierarchical decoder and auxiliary localization head as the proposed student model, ensuring the comparison occurs under identical weak-supervision conditions. The revised manuscript will explicitly state this architectural parity. We will additionally report results averaged over three independent training runs with standard deviation to quantify variance, and we will insert a new ablation table that isolates the contribution of the auxiliary head and the self-correction mechanism to the observed gains. revision: yes
Circularity Check
No significant circularity; central results are empirical performance metrics
full rationale
The paper presents an empirical framework for training a compact student model on noisy SAM pseudo-masks generated from bounding boxes. Reported gains (+6.97 anomaly mIoU, +9.71 binary IoU, +18.56 recall) are measured on a held-out manually annotated wind-turbine benchmark under identical weak-supervision conditions. No equations, derivations, or first-principles results are given that reduce these quantities to fitted parameters, self-referential definitions, or self-citation chains. The one-sided self-correction mechanism is a training heuristic whose effect is validated externally by the benchmark numbers rather than being tautological with its own inputs. The work is therefore self-contained against external evaluation.
Axiom & Free-Parameter Ledger
free parameters (1)
- self-correction confidence threshold
axioms (2)
- domain assumption Frozen DINOv2 features supply semantically stable representations suitable for industrial defect images
- domain assumption SAM pseudo-masks exhibit systematic false-negative bias on sparse industrial defects that self-correction can mitigate
Reference graph
Works this paper leans on
-
[1]
Jiwoon Ahn and Suha Kwak. Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4981–4990, 2018. 2
work page 2018
-
[2]
Emerging properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021. 2
work page 2021
-
[3]
Zhaozheng Chen and Qianru Sun. Weakly-supervised semantic segmentation with image-level labels: from traditional models to foundation models.ACM Computing Surveys, 57(5):1–29, 2025. 3
work page 2025
-
[4]
Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation
Jifeng Dai, Kaiming He, and Jian Sun. Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation. InProceedings of the IEEE international conference on computer vision, pages 1635–1643, 2015. 2
work page 2015
-
[5]
Drone footage wind turbine surface damage detection
Ashley Foster, Oscar Best, Mario Gianni, Asiya Khan, Keri Collins, and Sanjay Sharma. Drone footage wind turbine surface damage detection. In2022 IEEE 14th Image, Video, and Multidimensional Signal Processing Workshop (IVMSP), pages 1–5. IEEE, 2022. 1, 6
work page 2022
-
[6]
arXiv preprint arXiv:2203.08414 , year=
Mark Hamilton, Zhoutong Zhang, Bharath Hariharan, Noah Snavely, and William T Freeman. Unsupervised semantic segmentation by distilling feature correspondences.arXiv preprint arXiv:2203.08414, 2022. 3
-
[7]
Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor Tsang, and Masashi Sugiyama. Co-teaching: Robust training of deep neural networks with extremely noisy labels.Advances in neural information processing systems, 31, 2018. 3
work page 2018
-
[8]
Seyed Raein Hashemi, Seyed Sadegh Mohseni Salehi, Deniz Erdogmus, Sanjay P Prabhu, Simon K Warfield, and Ali Gholipour. Asymmetric loss functions and deep densely-connected networks for highly-imbalanced medical image segmentation: Application to multiple sclerosis lesion detection.IEEE Access, 7:1721–1735, 2018. 5
work page 2018
-
[9]
Masked autoencoders are scalable vision learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022. 2
work page 2022
-
[10]
Winclip: Zero-/few-shot anomaly classification and segmentation
Jongheon Jeong, Yang Zou, Taewan Kim, Dongqing Zhang, Avinash Ravichandran, and Onkar Dabeer. Winclip: Zero-/few-shot anomaly classification and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19606–19616, 2023. 2
work page 2023
-
[11]
Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision?Advances in neural information processing systems, 30, 2017. 3
work page 2017
-
[12]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InProceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023. 2, 3
work page 2023
-
[13]
Discobox: Weakly supervised instance segmentation and semantic correspondence from box supervision
Shiyi Lan, Zhiding Yu, Christopher Choy, Subhashree Radhakrishnan, Guilin Liu, Yuke Zhu, Larry S Davis, and Anima Anandkumar. Discobox: Weakly supervised instance segmentation and semantic correspondence from box supervision. InProceedings of the IEEE/CVF international conference on computer vision, pages 3406–3416, 2021. 2
work page 2021
-
[14]
Dividemix: Learning with noisy labelsassemi-supervisedlearning
Junnan Li, Richard Socher, and Steven CH Hoi. Dividemix: Learning with noisy labels as semi-supervised learning. arXiv preprint arXiv:2002.07394, 2020. 3
-
[15]
Selective-supervised contrastive learning with noisy labels
Shikun Li, Xiaobo Xia, Shiming Ge, and Tongliang Liu. Selective-supervised contrastive learning with noisy labels. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 316–325, 2022. 3
work page 2022
-
[16]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 2
work page 2014
-
[17]
Ssd: Single shot multibox detector
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. InEuropean conference on computer vision, pages 21–37. Springer, 2016. 1
work page 2016
-
[18]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 5
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[19]
Nagarajan Natarajan, Inderjit S Dhillon, Pradeep K Ravikumar, and Ambuj Tewari. Learning with noisy labels. Advances in neural information processing systems, 26, 2013. 3
work page 2013
-
[20]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timoth ´ee Darcet, Th´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 2, 3, 4
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[21]
Making deep neural networks robust to label noise: A loss correction approach
Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock, and Lizhen Qu. Making deep neural networks robust to label noise: A loss correction approach. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1944–1952, 2017. 3
work page 1944
-
[22]
Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging.SIAM journal on control and optimization, 30(4):838–855, 1992. 5
work page 1992
-
[23]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2
work page 2021
-
[24]
You only look once: Unified, real-time object detection
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788,
-
[25]
Boundary-enhanced co-training for weakly supervised semantic segmentation
Shenghai Rong, Bohai Tu, Zilei Wang, and Junjie Li. Boundary-enhanced co-training for weakly supervised semantic segmentation. InProceedings of the IEEE/CVF 9 conference on computer vision and pattern recognition, pages 19574–19584, 2023. 3
work page 2023
-
[26]
U-net: Convolutional networks for biomedical image segmentation
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer,
-
[27]
Towards total recall in industrial anomaly detection
Karsten Roth, Latha Pemula, Joaquin Zepeda, Bernhard Sch¨olkopf, Thomas Brox, and Peter Gehler. Towards total recall in industrial anomaly detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14318–14328, 2022. 2
work page 2022
-
[28]
Grad-cam: Visual explanations from deep networks via gradient-based localization
Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. InProceedings of the IEEE international conference on computer vision, pages 618–626,
-
[29]
Dtu-drone inspection images of wind turbine
ASM Shihavuddin and X Chen. Dtu-drone inspection images of wind turbine. mendeley data, v2, 2018. 6
work page 2018
-
[30]
Learning from noisy labels by regularized estimation of annotator confusion
Ryutaro Tanno, Ardavan Saeedi, Swami Sankaranarayanan, Daniel C Alexander, and Nathan Silberman. Learning from noisy labels by regularized estimation of annotator confusion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11244–11253, 2019. 3
work page 2019
-
[31]
Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results.Advances in neural information processing systems, 30, 2017. 3
work page 2017
-
[32]
Boxinst: High-performance instance segmentation with box annotations
Zhi Tian, Chunhua Shen, Xinlong Wang, and Hao Chen. Boxinst: High-performance instance segmentation with box annotations. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5443–5452,
-
[33]
Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation
Yude Wang, Jie Zhang, Meina Kan, Shiguang Shan, and Xilin Chen. Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12275–12284, 2020. 2
work page 2020
-
[34]
Robust early-learning: Hindering the memorization of noisy labels
Xiaobo Xia, Tongliang Liu, Bo Han, Chen Gong, Nannan Wang, Zongyuan Ge, and Yi Chang. Robust early-learning: Hindering the memorization of noisy labels. InInternational conference on learning representations, 2020. 3
work page 2020
-
[35]
Segformer: Simple and efficient design for semantic segmentation with transformers
Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in neural information processing systems, 34: 12077–12090, 2021. 2
work page 2021
-
[36]
Uncertainty-aware self-ensembling model for semi-supervised 3d left atrium segmentation
Lequan Yu, Shujun Wang, Xiaomeng Li, Chi-Wing Fu, and Pheng-Ann Heng. Uncertainty-aware self-ensembling model for semi-supervised 3d left atrium segmentation. In International conference on medical image computing and computer-assisted intervention, pages 605–613. Springer,
-
[37]
Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models
Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1–9, 2022. 4
work page 2022
-
[38]
Enhancing wind turbine blade damage detection with yolo-wind.Scientific Reports, 15(1): 18667, 2025
Zhao Zhanfang and Li Tuo. Enhancing wind turbine blade damage detection with yolo-wind.Scientific Reports, 15(1): 18667, 2025. 1
work page 2025
-
[39]
Faster Segment Anything: Towards Lightweight SAM for Mobile Applications
Chaoning Zhang, Dongshen Han, Yu Qiao, Jung Uk Kim, Sung-Ho Bae, Seungkyu Lee, and Choong Seon Hong. Faster segment anything: Towards lightweight sam for mobile applications.arXiv preprint arXiv:2306.14289, 2023. 3
work page internal anchor Pith review arXiv 2023
-
[40]
Efficientvit-sam: Accelerated segment anything model without performance loss
Zhuoyang Zhang, Han Cai, and Song Han. Efficientvit-sam: Accelerated segment anything model without performance loss. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7859–7863,
-
[41]
A survey on segment anything model (sam): Vision foundation model meets prompt engineering
Xu Zhao, Wenchao Ding, Yongqi An, Yinglong Du, Tao Yu, Min Li, Ming Tang, and Jinqiao Wang. Fast segment anything.arXiv preprint arXiv:2306.12156, 2023. 3
-
[42]
Sam-driven weakly supervised nodule segmentation with uncertainty-aware cross teaching
Xingyue Zhao, Peiqi Li, Xiangde Luo, Meng Yang, Shi Chang, and Zhongyu Li. Sam-driven weakly supervised nodule segmentation with uncertainty-aware cross teaching. In2024 IEEE International Symposium on Biomedical Imaging (ISBI), pages 1–5. IEEE, 2024. 3
work page 2024
-
[43]
Learning deep features for discriminative localization
Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2921–2929, 2016. 2 10
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.