arxiv: 2604.11162 · v1 · submitted 2026-04-13 · 💻 cs.CV

Boxes2Pixels: Learning Defect Segmentation from Noisy SAM Masks

Camile Lendering , Erkut Akdag , Egor Bondarev This is my paper

Pith reviewed 2026-05-10 16:35 UTC · model grok-4.3

classification 💻 cs.CV

keywords defect segmentationSAM pseudo-masksweak supervisiononline self-correctionindustrial inspectionbox-to-pixel distillationanomaly detection

0 comments

The pith

Treating SAM as a noisy teacher lets a compact student learn accurate defect segmentation from bounding boxes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that pseudo-masks generated by SAM from bounding boxes remain useful for training defect segmentation models when treated as noisy supervision rather than ground truth. This would matter because dense pixel annotations are rarely available for industrial surfaces, while boxes are inexpensive to collect. The method freezes a feature backbone for stability, adds a decoder and separate binary head to handle localization, and applies one-sided online self-correction to override the teacher's background labels when the student detects defects confidently. If correct, weak box supervision could support reliable pixel-level inspection without full manual labeling.

Core claim

The paper claims that a noise-robust box-to-pixel distillation framework converts bounding boxes into SAM pseudo-masks offline and trains a compact student using a hierarchical decoder over frozen DINOv2 features, an auxiliary binary localization head, and a one-sided online self-correction that relaxes background supervision when the student is confident, targeting teacher false negatives. This produces higher anomaly mIoU and binary IoU than baselines trained identically on the same weak labels, plus higher binary recall, all with 80 percent fewer trainable parameters on a wind turbine inspection benchmark.

What carries the argument

The one-sided online self-correction mechanism that relaxes background supervision for confident student foreground predictions while treating SAM outputs as noisy labels.

If this is right

The student achieves higher anomaly mIoU and binary IoU than the strongest baseline under identical weak supervision from bounding boxes.
Online self-correction specifically raises binary recall by addressing missed defects in the teacher masks.
The trained model requires 80 percent fewer trainable parameters than alternative approaches.
Performance stems from viewing SAM as a noisy source rather than using its outputs directly as supervision.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The self-correction technique could transfer to distillation from other foundation models in segmentation tasks.
The framework may lower annotation costs for defect detection in additional industrial domains such as manufacturing lines.
Freezing the backbone features allows the approach to adapt to new surface types without retraining the entire network.

Load-bearing premise

The one-sided online self-correction reliably detects when to relax background supervision without introducing new false positives or harming generalization on unseen industrial surfaces.

What would settle it

Testing the full method on a new set of industrial surface images and finding that self-correction either lowers overall accuracy or adds false defect detections in background regions would show the mechanism fails to work as intended.

Figures

Figures reproduced from arXiv: 2604.11162 by Camile Lendering, Egor Bondarev, Erkut Akdag.

**Figure 2.** Figure 2: Overview of the hierarchical student architecture. The global semantic branch employs DINOv2 ViT-S/14 with BitFit adaptation [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative comparison on wind turbine blade inspection [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Accurate defect segmentation is critical for industrial inspection, yet dense pixel-level annotations are rarely available. A common workaround is to convert inexpensive bounding boxes into pseudo-masks using foundation segmentation models such as the Segment Anything Model (SAM). However, these pseudo-labels are systematically noisy on industrial surfaces, often hallucinating background structure while missing sparse defects. To address this limitation, a noise-robust box-to-pixel distillation framework, Boxes2Pixels, is proposed that treats SAM as a noisy teacher rather than a source of ground-truth supervision. Bounding boxes are converted into pseudo-masks offline by SAM, and a compact student is trained with (i) a hierarchical decoder over frozen DINOv2 features for semantic stability, (ii) an auxiliary binary localization head to decouple sparse foreground discovery from class prediction, and (iii) a one-sided online self-correction mechanism that relaxes background supervision when the student is confident, targeting teacher false negatives. On a manually annotated wind turbine inspection benchmark, the proposed Boxes2Pixels improves anomaly mIoU by +6.97 and binary IoU by +9.71 over the strongest baseline trained under identical weak supervision. Moreover, online self-correction increases the binary recall by +18.56, while the model employs 80\% fewer trainable parameters. Code is available at https://github.com/CLendering/Boxes2Pixels.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This gives a practical student model that cleans up SAM's noisy box-to-mask outputs for industrial defects, with solid reported gains and code out, though the self-correction step rests on unproven calibration assumptions.

read the letter

The core contribution here is a distillation setup that starts with SAM pseudo-masks from boxes, then trains a compact student on frozen DINOv2 features using a hierarchical decoder, an auxiliary binary localization head, and a one-sided online correction that drops background supervision only when the student is confident. This targets SAM's specific failure modes on sparse industrial defects rather than treating the pseudo-labels as clean supervision. The approach is not a generic weak-supervision baseline; the combination and the targeted correction are the new pieces. On the wind-turbine benchmark the numbers show clear lifts: roughly +7 anomaly mIoU and +10 binary IoU over the strongest comparable baseline, plus an 18-point recall jump from the correction alone, all with 80% fewer trainable parameters. Code release helps. The empirical side looks reproducible on the stated data. The soft spot is the self-correction itself. It assumes the student's confidence scores are well-calibrated on unseen industrial textures so that relaxing background labels does not create new false positives. The abstract gives no detail on threshold selection, held-out validation, or precision impact, so that link in the argument is the least secure. If the full paper shows ablation on the threshold and checks for over-correction on held-out surfaces, the concern shrinks; otherwise it stays a practical risk. This is aimed at applied computer-vision groups doing defect inspection where bounding boxes are easy to get but dense labels are not. A reader working on similar industrial tasks would find the setup and metrics useful. I would send it to peer review. The results are concrete, the code is public, and the engineering choices are transparent enough that referees can check the details and ask for the missing calibration evidence.

Referee Report

2 major / 2 minor

Summary. The paper introduces Boxes2Pixels, a noise-robust distillation framework for defect segmentation that converts bounding boxes to pseudo-masks via SAM and trains a compact student model using frozen DINOv2 features with a hierarchical decoder, an auxiliary binary localization head, and a one-sided online self-correction mechanism to relax background supervision on teacher false negatives. On a manually annotated wind turbine inspection benchmark, it reports gains of +6.97 anomaly mIoU and +9.71 binary IoU over the strongest baseline under identical weak supervision, with the self-correction adding +18.56 to binary recall; the model uses 80% fewer trainable parameters and code is released.

Significance. If the empirical gains hold under the reported conditions, the work provides a practical, parameter-efficient approach to leveraging noisy foundation-model pseudo-labels for industrial anomaly segmentation under weak supervision. The code release and focus on a real-world benchmark strengthen reproducibility and potential for deployment in inspection pipelines.

major comments (2)

[Method (self-correction)] The one-sided online self-correction mechanism (described in the method) relies on the student's confidence threshold to relax background supervision. No details are provided on threshold selection, validation on a held-out set, or measured effects on precision/false-positive rate. This is load-bearing for the +18.56 recall claim, as miscalibration on unseen industrial textures risks introducing new false positives.
[Experiments] Experimental section: the reported +6.97 anomaly mIoU and +9.71 binary IoU improvements are given against the strongest baseline, but lack explicit confirmation that all baselines use identical decoder/head architecture, number of training runs with variance, or ablation isolating each component's contribution.

minor comments (2)

[Abstract] Abstract: define or clarify the distinction between 'anomaly mIoU' and 'binary IoU'; specify the reference model size for the '80% fewer trainable parameters' statement.
[Figures/Tables] Figure and table captions should explicitly state whether results are averaged over multiple seeds or single-run.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation for minor revision. We address each major comment point-by-point below with clarifications and planned revisions.

read point-by-point responses

Referee: [Method (self-correction)] The one-sided online self-correction mechanism (described in the method) relies on the student's confidence threshold to relax background supervision. No details are provided on threshold selection, validation on a held-out set, or measured effects on precision/false-positive rate. This is load-bearing for the +18.56 recall claim, as miscalibration on unseen industrial textures risks introducing new false positives.

Authors: We acknowledge that the original manuscript provided insufficient detail on the self-correction threshold. In the revised version we will expand the method section to describe the threshold selection procedure, including its determination on a small held-out subset of the training data (without touching the test set) and the observed trade-off between recall gains and precision. We will also report the measured impact on false-positive rate on the wind-turbine benchmark, confirming that the reported recall improvement does not come at the cost of a substantial increase in false positives. These additions will directly address the concern about miscalibration on industrial textures. revision: yes
Referee: [Experiments] Experimental section: the reported +6.97 anomaly mIoU and +9.71 binary IoU improvements are given against the strongest baseline, but lack explicit confirmation that all baselines use identical decoder/head architecture, number of training runs with variance, or ablation isolating each component's contribution.

Authors: We confirm that every baseline was trained with the identical hierarchical decoder and auxiliary localization head as the proposed student model, ensuring the comparison occurs under identical weak-supervision conditions. The revised manuscript will explicitly state this architectural parity. We will additionally report results averaged over three independent training runs with standard deviation to quantify variance, and we will insert a new ablation table that isolates the contribution of the auxiliary head and the self-correction mechanism to the observed gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity; central results are empirical performance metrics

full rationale

The paper presents an empirical framework for training a compact student model on noisy SAM pseudo-masks generated from bounding boxes. Reported gains (+6.97 anomaly mIoU, +9.71 binary IoU, +18.56 recall) are measured on a held-out manually annotated wind-turbine benchmark under identical weak-supervision conditions. No equations, derivations, or first-principles results are given that reduce these quantities to fitted parameters, self-referential definitions, or self-citation chains. The one-sided self-correction mechanism is a training heuristic whose effect is validated externally by the benchmark numbers rather than being tautological with its own inputs. The work is therefore self-contained against external evaluation.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The approach rests on standard machine-learning assumptions about feature stability and noise patterns rather than new mathematical axioms or invented physical entities; a small number of training hyperparameters are implicit but not load-bearing for the conceptual claim.

free parameters (1)

self-correction confidence threshold
Hyperparameter controlling when background supervision is relaxed; value not stated in abstract but required for the one-sided mechanism.

axioms (2)

domain assumption Frozen DINOv2 features supply semantically stable representations suitable for industrial defect images
Invoked to justify the hierarchical decoder design for stability.
domain assumption SAM pseudo-masks exhibit systematic false-negative bias on sparse industrial defects that self-correction can mitigate
Core premise enabling the one-sided relaxation strategy.

pith-pipeline@v0.9.0 · 5548 in / 1276 out tokens · 33317 ms · 2026-05-10T16:35:25.171436+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 3 internal anchors

[1]

Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation

Jiwoon Ahn and Suha Kwak. Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4981–4990, 2018. 2

work page 2018
[2]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021. 2

work page 2021
[3]

Weakly-supervised semantic segmentation with image-level labels: from traditional models to foundation models.ACM Computing Surveys, 57(5):1–29, 2025

Zhaozheng Chen and Qianru Sun. Weakly-supervised semantic segmentation with image-level labels: from traditional models to foundation models.ACM Computing Surveys, 57(5):1–29, 2025. 3

work page 2025
[4]

Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation

Jifeng Dai, Kaiming He, and Jian Sun. Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation. InProceedings of the IEEE international conference on computer vision, pages 1635–1643, 2015. 2

work page 2015
[5]

Drone footage wind turbine surface damage detection

Ashley Foster, Oscar Best, Mario Gianni, Asiya Khan, Keri Collins, and Sanjay Sharma. Drone footage wind turbine surface damage detection. In2022 IEEE 14th Image, Video, and Multidimensional Signal Processing Workshop (IVMSP), pages 1–5. IEEE, 2022. 1, 6

work page 2022
[6]

arXiv preprint arXiv:2203.08414 , year=

Mark Hamilton, Zhoutong Zhang, Bharath Hariharan, Noah Snavely, and William T Freeman. Unsupervised semantic segmentation by distilling feature correspondences.arXiv preprint arXiv:2203.08414, 2022. 3

work page arXiv 2022
[7]

Co-teaching: Robust training of deep neural networks with extremely noisy labels.Advances in neural information processing systems, 31, 2018

Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor Tsang, and Masashi Sugiyama. Co-teaching: Robust training of deep neural networks with extremely noisy labels.Advances in neural information processing systems, 31, 2018. 3

work page 2018
[8]

Seyed Raein Hashemi, Seyed Sadegh Mohseni Salehi, Deniz Erdogmus, Sanjay P Prabhu, Simon K Warfield, and Ali Gholipour. Asymmetric loss functions and deep densely-connected networks for highly-imbalanced medical image segmentation: Application to multiple sclerosis lesion detection.IEEE Access, 7:1721–1735, 2018. 5

work page 2018
[9]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022. 2

work page 2022
[10]

Winclip: Zero-/few-shot anomaly classification and segmentation

Jongheon Jeong, Yang Zou, Taewan Kim, Dongqing Zhang, Avinash Ravichandran, and Onkar Dabeer. Winclip: Zero-/few-shot anomaly classification and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19606–19616, 2023. 2

work page 2023
[11]

What uncertainties do we need in bayesian deep learning for computer vision?Advances in neural information processing systems, 30, 2017

Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision?Advances in neural information processing systems, 30, 2017. 3

work page 2017
[12]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InProceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023. 2, 3

work page 2023
[13]

Discobox: Weakly supervised instance segmentation and semantic correspondence from box supervision

Shiyi Lan, Zhiding Yu, Christopher Choy, Subhashree Radhakrishnan, Guilin Liu, Yuke Zhu, Larry S Davis, and Anima Anandkumar. Discobox: Weakly supervised instance segmentation and semantic correspondence from box supervision. InProceedings of the IEEE/CVF international conference on computer vision, pages 3406–3416, 2021. 2

work page 2021
[14]

Dividemix: Learning with noisy labelsassemi-supervisedlearning

Junnan Li, Richard Socher, and Steven CH Hoi. Dividemix: Learning with noisy labels as semi-supervised learning. arXiv preprint arXiv:2002.07394, 2020. 3

work page arXiv 2002
[15]

Selective-supervised contrastive learning with noisy labels

Shikun Li, Xiaobo Xia, Shiming Ge, and Tongliang Liu. Selective-supervised contrastive learning with noisy labels. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 316–325, 2022. 3

work page 2022
[16]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 2

work page 2014
[17]

Ssd: Single shot multibox detector

Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. InEuropean conference on computer vision, pages 21–37. Springer, 2016. 1

work page 2016
[18]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 5

work page internal anchor Pith review Pith/arXiv arXiv 2017
[19]

Learning with noisy labels

Nagarajan Natarajan, Inderjit S Dhillon, Pradeep K Ravikumar, and Ambuj Tewari. Learning with noisy labels. Advances in neural information processing systems, 26, 2013. 3

work page 2013
[20]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 2, 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Making deep neural networks robust to label noise: A loss correction approach

Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock, and Lizhen Qu. Making deep neural networks robust to label noise: A loss correction approach. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1944–1952, 2017. 3

work page 1944
[22]

Acceleration of stochastic approximation by averaging.SIAM journal on control and optimization, 30(4):838–855, 1992

Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging.SIAM journal on control and optimization, 30(4):838–855, 1992. 5

work page 1992
[23]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2

work page 2021
[24]

You only look once: Unified, real-time object detection

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788,

work page
[25]

Boundary-enhanced co-training for weakly supervised semantic segmentation

Shenghai Rong, Bohai Tu, Zilei Wang, and Junjie Li. Boundary-enhanced co-training for weakly supervised semantic segmentation. InProceedings of the IEEE/CVF 9 conference on computer vision and pattern recognition, pages 19574–19584, 2023. 3

work page 2023
[26]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer,

work page
[27]

Towards total recall in industrial anomaly detection

Karsten Roth, Latha Pemula, Joaquin Zepeda, Bernhard Sch¨olkopf, Thomas Brox, and Peter Gehler. Towards total recall in industrial anomaly detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14318–14328, 2022. 2

work page 2022
[28]

Grad-cam: Visual explanations from deep networks via gradient-based localization

Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. InProceedings of the IEEE international conference on computer vision, pages 618–626,

work page
[29]

Dtu-drone inspection images of wind turbine

ASM Shihavuddin and X Chen. Dtu-drone inspection images of wind turbine. mendeley data, v2, 2018. 6

work page 2018
[30]

Learning from noisy labels by regularized estimation of annotator confusion

Ryutaro Tanno, Ardavan Saeedi, Swami Sankaranarayanan, Daniel C Alexander, and Nathan Silberman. Learning from noisy labels by regularized estimation of annotator confusion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11244–11253, 2019. 3

work page 2019
[31]

Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results.Advances in neural information processing systems, 30, 2017

Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results.Advances in neural information processing systems, 30, 2017. 3

work page 2017
[32]

Boxinst: High-performance instance segmentation with box annotations

Zhi Tian, Chunhua Shen, Xinlong Wang, and Hao Chen. Boxinst: High-performance instance segmentation with box annotations. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5443–5452,

work page
[33]

Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation

Yude Wang, Jie Zhang, Meina Kan, Shiguang Shan, and Xilin Chen. Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12275–12284, 2020. 2

work page 2020
[34]

Robust early-learning: Hindering the memorization of noisy labels

Xiaobo Xia, Tongliang Liu, Bo Han, Chen Gong, Nannan Wang, Zongyuan Ge, and Yi Chang. Robust early-learning: Hindering the memorization of noisy labels. InInternational conference on learning representations, 2020. 3

work page 2020
[35]

Segformer: Simple and efficient design for semantic segmentation with transformers

Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in neural information processing systems, 34: 12077–12090, 2021. 2

work page 2021
[36]

Uncertainty-aware self-ensembling model for semi-supervised 3d left atrium segmentation

Lequan Yu, Shujun Wang, Xiaomeng Li, Chi-Wing Fu, and Pheng-Ann Heng. Uncertainty-aware self-ensembling model for semi-supervised 3d left atrium segmentation. In International conference on medical image computing and computer-assisted intervention, pages 605–613. Springer,

work page
[37]

Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models

Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1–9, 2022. 4

work page 2022
[38]

Enhancing wind turbine blade damage detection with yolo-wind.Scientific Reports, 15(1): 18667, 2025

Zhao Zhanfang and Li Tuo. Enhancing wind turbine blade damage detection with yolo-wind.Scientific Reports, 15(1): 18667, 2025. 1

work page 2025
[39]

Faster Segment Anything: Towards Lightweight SAM for Mobile Applications

Chaoning Zhang, Dongshen Han, Yu Qiao, Jung Uk Kim, Sung-Ho Bae, Seungkyu Lee, and Choong Seon Hong. Faster segment anything: Towards lightweight sam for mobile applications.arXiv preprint arXiv:2306.14289, 2023. 3

work page internal anchor Pith review arXiv 2023
[40]

Efficientvit-sam: Accelerated segment anything model without performance loss

Zhuoyang Zhang, Han Cai, and Song Han. Efficientvit-sam: Accelerated segment anything model without performance loss. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7859–7863,

work page
[41]

A survey on segment anything model (sam): Vision foundation model meets prompt engineering

Xu Zhao, Wenchao Ding, Yongqi An, Yinglong Du, Tao Yu, Min Li, Ming Tang, and Jinqiao Wang. Fast segment anything.arXiv preprint arXiv:2306.12156, 2023. 3

work page arXiv 2023
[42]

Sam-driven weakly supervised nodule segmentation with uncertainty-aware cross teaching

Xingyue Zhao, Peiqi Li, Xiangde Luo, Meng Yang, Shi Chang, and Zhongyu Li. Sam-driven weakly supervised nodule segmentation with uncertainty-aware cross teaching. In2024 IEEE International Symposium on Biomedical Imaging (ISBI), pages 1–5. IEEE, 2024. 3

work page 2024
[43]

Learning deep features for discriminative localization

Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2921–2929, 2016. 2 10

work page 2016