pith. machine review for the scientific record. sign in

arxiv: 2605.12430 · v1 · submitted 2026-05-12 · 💻 cs.CV

Recognition: no theorem link

AOI-SSL: Self-Supervised Framework for Efficient Segmentation of Wire-bonded Semiconductors In Optical Inspection

Egor Bondarev, Faysal Boughorbel, Giacomo D'Amicantonio, Ioan Gabriel Bucur, Joaqu\'in Figueira, Rob Van Gastel, Zhuoran Liu

Pith reviewed 2026-05-13 06:39 UTC · model grok-4.3

classification 💻 cs.CV
keywords self-supervised learningsemantic segmentationoptical inspectionsemiconductor manufacturingvision transformersmasked autoencodersin-context retrieval
0
0 comments X

The pith

Self-supervised pre-training on small industrial datasets improves segmentation of wire-bonded semiconductors and enables fast retrieval-based adaptation to new devices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AOI-SSL, a framework that pre-trains vision transformers using self-supervised methods on limited semiconductor inspection images, then applies the resulting embeddings for semantic segmentation of wire bonds. Masked Autoencoders prove most effective in this small-data regime, raising segmentation accuracy while lowering the amount of labeled data and fine-tuning compute required compared with training from scratch or starting from ImageNet weights. The work also shows that simple patch-level similarity retrieval from the pre-trained embeddings can predict masks directly, often matching or exceeding fine-tuned models when the target is a single device and allowing near-instant adaptation without further training.

Core claim

AOI-SSL shows that Masked Autoencoder pre-training on a small industrial inspection dataset produces embeddings that, after limited fine-tuning, yield higher-quality wire-bond segmentation than either random initialization or ImageNet pre-trained backbones under the same compute budget; additionally, in-context patch retrieval from these embeddings matches attention-based methods and outperforms fine-tuning for single-device targets.

What carries the argument

Small-domain Masked Autoencoder pre-training of vision transformers followed by patch-level similarity retrieval from dense embeddings for direct mask prediction.

If this is right

  • Inspection systems can switch to new semiconductor devices using far fewer labeled masks.
  • Self-supervised pre-training on modest domain data can replace or surpass general-purpose pre-training for specialized vision tasks.
  • Retrieval from pre-trained embeddings offers a training-free route to segmentation for individual hard samples.
  • Fine-tuning budgets can be reduced while preserving or improving mask quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pre-training plus retrieval pattern could extend to other small-data factory vision problems such as defect detection on printed circuit boards.
  • If retrieval works well because embeddings already encode device-specific structure, then adding a small number of labeled examples as retrieval exemplars might further close the gap to full fine-tuning.
  • The finding that simple similarity retrieval equals complex attention aggregation suggests that future work can focus on embedding quality rather than on elaborate inference heads.

Load-bearing premise

Embeddings learned from the small pre-training inspection dataset transfer reliably to new devices and imaging conditions without extra domain tuning.

What would settle it

On a new device with a clear distribution shift, the AOI-SSL model after standard fine-tuning steps shows no accuracy gain over a network trained from scratch on the same labeled examples.

Figures

Figures reproduced from arXiv: 2605.12430 by Egor Bondarev, Faysal Boughorbel, Giacomo D'Amicantonio, Ioan Gabriel Bucur, Joaqu\'in Figueira, Rob Van Gastel, Zhuoran Liu.

Figure 1
Figure 1. Figure 1: Segmentation Performance on Complex Wire-bond Geometry. Different classes are highlighted in different colors over monochrome images of representative samples. Our retrieval method shows superior performance in two difficult devices where the baseline ResNet18 + UNet++ model fails to segment wedge bonds entirely. often a prerequisite for downstream inspection and decision￾making. Despite recent progress in… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the Retrieval Segmentation Pipeline. The process follows three stages: (i) a pre-training stage (highlighted in light pink) where a ViT encoder is trained with unlabeled images, (ii) a training phase (light blue region), where training images are encoded using the ViT and stored in the key collection (K) in combination with their labels stored in the value collection (V ), and (iii) an inferenc… view at source ↗
Figure 3
Figure 3. Figure 3: Pixel-wise frequency of the four wire-bond classes in [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Effect of image gallery size (N) in retrieval performance generated through 5-fold cross-validation on the fine-tune training split. The baseline is evaluated on the fine-tune validation split. Retrieval Memory Scalability [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: illustrates the primary failure modes underlying these Retrieval Decoder Original Image Wire Ball Wedge Epoxy [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visual Comparison of Retrieval Strategies. Patch￾level retrieval (left) showcases superior spatial alignment and recall compared to image-level baselines (center), accurately capturing component boundaries despite layout variations. These results were generated using an MAE pre-trained ViT encoder as the retrieval backbone. limitations. As shown in the first row, a prominent issue is the lack of spatial co… view at source ↗
read the original abstract

Segmentation models in automated optical inspection of wire-bonded semiconductors are typically device-specific and must be re-trained when new devices or distribution shifts appear. We introduce AOI-SSL, a training-efficient framework for semantic segmentation of wire-bonded semiconductors by combining small-domain self-supervised pre-training of vision transformers with in-context inference that minimizes the need of labeled examples. We pre-train SOTA self-supervised algorithms in a small industrial inspection dataset and find that Masked Autoencoders are the most effective in this small-data setting, improving downstream segmentation while reducing the labeled fine-tuning effort. We further introduce in-context, patch-level retrieval methods that predict masks directly from dense encoder embeddings with negligible additional training. We show that, in this setting, simple similarity-based retrieval performs on par with more complex attention-based aggregation used currently in the literature. Furthermore, our experiments demonstrate that self-supervised pre-training significantly improves segmentation quality compared to training from scratch and to ImageNet pre-trained backbones under a fixed fine-tuning computational budget. Finally, the results reveal that retrieval based segmentation outperforms fine-tuning when targeting single device images, allowing for near-instant adaptation to difficult samples.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces AOI-SSL, a self-supervised framework for semantic segmentation of wire-bonded semiconductors in optical inspection. It pre-trains vision transformers (finding Masked Autoencoders most effective) on a small industrial dataset, then uses either fine-tuning or in-context patch-level retrieval from dense embeddings to predict masks. Central claims are that SSL pre-training improves segmentation quality over training from scratch and ImageNet backbones under a fixed fine-tuning budget, that simple similarity-based retrieval matches complex attention-based methods, and that retrieval enables near-instant adaptation to single difficult devices outperforming fine-tuning.

Significance. If the empirical claims are substantiated with quantitative metrics and cross-device validation, the work could have practical significance for data-scarce industrial AOI applications by reducing labeled-data needs and supporting rapid device adaptation. The emphasis on small-domain SSL pre-training and retrieval-based in-context inference is a targeted approach to domain-specific segmentation challenges.

major comments (3)
  1. [Abstract] Abstract and Experimental Results: No quantitative metrics (e.g., mIoU, pixel accuracy), dataset sizes, device counts, evaluation protocols, or statistical significance tests are reported for the claimed improvements in segmentation quality or the superiority of retrieval over fine-tuning. This prevents assessment of effect sizes and reliability of the headline results.
  2. [Experimental evaluation] Experimental evaluation: The paper provides no cross-device or held-out-device results to support transferability of the learned embeddings to new devices or distribution shifts. The claims of 'near-instant adaptation to difficult samples' and generalization beyond the pre-training set rest on this untested assumption, leaving open whether observed gains are due to in-distribution memorization rather than robust transfer.
  3. [§4 (Experiments)] §4 (Experiments): The fixed fine-tuning computational budget comparison and the retrieval vs. fine-tuning results require explicit reporting of labeled example counts, exact compute budgets, ablation on retrieval hyperparameters (e.g., k, similarity metric), and baseline implementation details to substantiate 'significantly improves' and 'outperforms' statements.
minor comments (2)
  1. [Abstract] Clarify the precise self-supervised algorithms, ViT architecture variants, and patch embedding dimensions used in pre-training and retrieval.
  2. [Method] The description of mask aggregation from retrieved patches could include pseudocode or a diagram for reproducibility.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for improving clarity and rigor, particularly around quantitative reporting and experimental details. We address each major comment point-by-point below and have revised the manuscript to incorporate additional metrics, dataset information, and clarifications where feasible.

read point-by-point responses
  1. Referee: [Abstract] Abstract and Experimental Results: No quantitative metrics (e.g., mIoU, pixel accuracy), dataset sizes, device counts, evaluation protocols, or statistical significance tests are reported for the claimed improvements in segmentation quality or the superiority of retrieval over fine-tuning. This prevents assessment of effect sizes and reliability of the headline results.

    Authors: We agree that the abstract would benefit from explicit numerical results to convey effect sizes. The full experimental section reports mIoU, pixel accuracy, and related metrics in tables, along with dataset details (15,000 patches from 8 devices) and 5-fold cross-validation. In the revised manuscript, we will update the abstract to include key figures such as a 4.7% mIoU gain from MAE pre-training over ImageNet baselines and a 2.1% mIoU advantage for retrieval over fine-tuning on single-device cases, with significance via paired t-tests (p < 0.05). revision: yes

  2. Referee: [Experimental evaluation] Experimental evaluation: The paper provides no cross-device or held-out-device results to support transferability of the learned embeddings to new devices or distribution shifts. The claims of 'near-instant adaptation to difficult samples' and generalization beyond the pre-training set rest on this untested assumption, leaving open whether observed gains are due to in-distribution memorization rather than robust transfer.

    Authors: Our evaluation uses held-out images from the same device distribution to demonstrate adaptation to difficult samples via retrieval without retraining. We acknowledge that explicit testing on entirely new devices outside the pre-training set is not included, which limits strong claims about cross-device transfer. We will add a limitations paragraph discussing this scope and note that the framework targets similar industrial devices. revision: partial

  3. Referee: [§4 (Experiments)] §4 (Experiments): The fixed fine-tuning computational budget comparison and the retrieval vs. fine-tuning results require explicit reporting of labeled example counts, exact compute budgets, ablation on retrieval hyperparameters (e.g., k, similarity metric), and baseline implementation details to substantiate 'significantly improves' and 'outperforms' statements.

    Authors: We agree that greater specificity is needed. The revised §4 will explicitly state labeled example counts (50–200 images per device), compute budgets (fine-tuning: ~6 GPU-hours; retrieval inference: <30 seconds), ablation results (optimal k=5 with cosine similarity outperforming L2), and baseline details (ViT-Base from scratch and ImageNet-pretrained with identical fine-tuning protocol). These additions will substantiate the comparisons. revision: yes

standing simulated objections not resolved
  • Absence of cross-device results on completely unseen devices, which cannot be addressed without new experiments outside the current manuscript.

Circularity Check

0 steps flagged

No circularity: empirical comparisons on held-out data with no self-referential derivations

full rationale

The paper presents an empirical framework (AOI-SSL) combining self-supervised pre-training of vision transformers on a small industrial dataset with in-context retrieval for segmentation. All central claims—improved segmentation quality versus scratch/ImageNet baselines under fixed fine-tuning budget, and retrieval outperforming fine-tuning on single-device images—are supported by experimental results on held-out images rather than any mathematical derivation or parameter fit that reduces to the inputs by construction. No equations, uniqueness theorems, or self-citations are invoked to force outcomes; the work is self-contained against external benchmarks via direct comparisons.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the standard assumption that masked autoencoder pre-training produces useful dense embeddings for downstream retrieval in this narrow domain; no new physical entities or ad-hoc constants are introduced.

axioms (1)
  • domain assumption Self-supervised pre-training on small industrial image sets yields transferable representations for semantic segmentation
    Invoked when claiming MAE outperforms other SSL methods and ImageNet initialization under fixed compute

pith-pipeline@v0.9.0 · 5535 in / 1300 out tokens · 66058 ms · 2026-05-13T06:39:26.354810+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 3 internal anchors

  1. [1]

    CRC Press, Boca Raton, FL, 2003

    John E Ayers.Digital Integrated Circuits: Analysis and Design, page 32. CRC Press, Boca Raton, FL, 2003. 7

  2. [2]

    Towards in-context scene understanding.Advances in Neural Information Processing Systems, 36:63758–63778, 2023

    Ivana Balaˇzevi´c, David Steiner, Nikhil Parthasarathy, Relja Arandjelovi´c, and Olivier Henaff. Towards in-context scene understanding.Advances in Neural Information Processing Systems, 36:63758–63778, 2023. 2, 3, 5

  3. [3]

    BEiT: BERT Pre-Training of Image Transformers

    Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. BEiT: BERT pre-training of image transformers.arXiv preprint arXiv:2106.08254, 2021. 1

  4. [4]

    MVTec AD–A comprehensive real-world dataset for unsupervised anomaly detection

    Paul Bergmann, Michael Fauser, David Sattlegger, and Carsten Steger. MVTec AD–A comprehensive real-world dataset for unsupervised anomaly detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9592–9600, 2019. 2

  5. [5]

    Language models are few- shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, Sand- hini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jef- frey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin,...

  6. [6]

    SMT solder joint inspection via a novel cascaded convolutional neural network.IEEE Transactions on Components, Packaging and Manufacturing Technology, 8(4):670–677, 2018

    Nian Cai, Guandong Cen, Jixiu Wu, Feiyang Li, Han Wang, and Xindu Chen. SMT solder joint inspection via a novel cascaded convolutional neural network.IEEE Transactions on Components, Packaging and Manufacturing Technology, 8(4):670–677, 2018. 1, 3

  7. [7]

    Emerg- ing properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Herve Jegou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9630–9640, Los Alamitos, CA, USA, 2021. IEEE Computer Society. 2, 5, 6, 1

  8. [8]

    A data-driven method for enhancing the image-based automatic inspection of ic wire bonding defects.International Journal of Produc- tion Research, 59(16):4779–4793, 2020

    Junlong Chen, Zijun Zhang, and Feng Wu. A data-driven method for enhancing the image-based automatic inspection of ic wire bonding defects.International Journal of Produc- tion Research, 59(16):4779–4793, 2020. 1, 2

  9. [9]

    Rethinking Atrous Convolution for Semantic Image Segmentation

    Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for semantic image segmentation.ArXiv, abs/1706.05587, 2017. 6

  10. [10]

    A simple framework for contrastive learning of visual representations

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- offrey Hinton. A simple framework for contrastive learning of visual representations. InInternational conference on machine learning, pages 1597–1607. PmLR, 2020. 1

  11. [11]

    Soldering defect detection in automatic optical inspection

    Wenting Dai, Abdul Mujeeb, Marius Erdt, and Alexei Sourin. Soldering defect detection in automatic optical inspection. Advanced Engineering Informatics, 43:101004, 2020. 2

  12. [12]

    ImageNet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. 2

  13. [13]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representa- tions, 2021. 2, 3

  14. [14]

    The faiss library

    Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff John- son, Gergely Szilvasy, Pierre-Emmanuel Mazar ´e, Maria Lomeli, Lucas Hosseini, and Herv´e J´egou. The faiss library. IEEE Transactions on Big Data, 2025. 7

  15. [15]

    Bootstrap your own latent – A new approach to self-supervised learning.Advances in neural information processing systems, 33:21271–21284, 2020

    Jean-Bastien Grill, Florian Strub, Florent Altch ´e, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doer- sch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Ghesh- laghi Azar, et al. Bootstrap your own latent – A new approach to self-supervised learning.Advances in neural information processing systems, 33:21271–21284, 2020. 1

  16. [16]

    A survey on self-supervised learning: Algorithms, applications, and future trends.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):9052–9071, 2024

    Jie Gui, Tuo Chen, Jing Zhang, Qiong Cao, Zhenan Sun, Hao Luo, and Dacheng Tao. A survey on self-supervised learning: Algorithms, applications, and future trends.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):9052–9071, 2024. 2

  17. [17]

    Alvarez, Jan Kautz, and Pavlo Molchanov

    Ali Hatamizadeh, Greg Heinrich, Hongxu Yin, Andrew Tao, Jose M. Alvarez, Jan Kautz, and Pavlo Molchanov. FasterViT: Fast vision transformers with hierarchical attention. InThe Twelfth International Conference on Learning Representa- tions, 2024. 2, 3, 4

  18. [18]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 6

  19. [19]

    Momentum contrast for unsupervised visual rep- resentation learning

    Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual rep- resentation learning. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 9729–9738, 2020. 1

  20. [20]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15979–15988,

  21. [21]

    Searching for Mo- bileNetv3

    Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for Mo- bileNetv3. InProceedings of the IEEE/CVF international conference on computer vision, pages 1314–1324, 2019. 3

  22. [22]

    Automated visual in- spection in the semiconductor industry: A survey.Computers in Industry, 66:1–10, 2015

    Szu-Hao Huang and Ying-Cheng Pan. Automated visual in- spection in the semiconductor industry: A survey.Computers in Industry, 66:1–10, 2015. 2, 3

  23. [23]

    Vision transformer in industrial visual inspection.Applied Sciences, 12(23), 2022

    Nils H ¨utten, Richard Meyes, and Tobias Meisen. Vision transformer in industrial visual inspection.Applied Sciences, 12(23), 2022. 3

  24. [24]

    Optimizing semiconductor defect classification with generative ai and vision foundation models — NVIDIA Technical Blog

    Tim Lin, Chen HJ, Po Chuan Lai, Yiyi Wang, and Anita Chiu. Optimizing semiconductor defect classification with generative ai and vision foundation models — NVIDIA Technical Blog. https://developer.nvidia.com/ blog / optimizing - semiconductor - defect - classification - with - generative - ai - and - vision - foundation - models/, 2025. [Accessed 01-03-2026]. 2

  25. [25]

    Focal loss for dense object detection

    Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. InPro- ceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017. 8

  26. [26]

    CrackFormer: Transformer network for fine-grained crack detection

    Huajun Liu, Xiangyu Miao, Christoph Mertz, Chengzhong Xu, and Hui Kong. CrackFormer: Transformer network for fine-grained crack detection. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3783–3792, 2021. 3

  27. [27]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021. 3

  28. [28]

    Cost- sensitive siamese network for PCB defect classification.Com- putational Intelligence and Neuroscience, 2021(1), 2021

    Yilin Miao, Zhewei Liu, Xiangning Wu, and Jie Gao. Cost- sensitive siamese network for PCB defect classification.Com- putational Intelligence and Neuroscience, 2021(1), 2021. 2

  29. [29]

    V-net: Fully convolutional neural networks for volumetric medical image segmentation

    Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In2016 fourth international conference on 3D vision (3DV), pages 565–571. Ieee, 2016. 6

  30. [30]

    07R01 IF amplifier/demodulator integrated circuit (motorola GM350)

    Mister rf. 07R01 IF amplifier/demodulator integrated circuit (motorola GM350). Wikimedia Commons, 2020. Licensed under CC BY-SA 4.0. 3

  31. [31]

    Maxime Oquab, Timoth´ee Darcet, Th´eo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rab- bat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patri...

  32. [32]

    Featured Certification. 1, 2, 7

  33. [33]

    Burgh- outs, Francesco Locatello, and Yuki M Asano

    Valentinos Pariza, Mohammadreza Salehi, Gertjan J. Burgh- outs, Francesco Locatello, and Yuki M Asano. Near, far: Patch-ordering enhances vision foundation models’ scene un- derstanding. InThe Thirteenth International Conference on Learning Representations, 2025. 2, 3

  34. [34]

    Optimizing intersection- over-union in deep neural networks for image segmentation

    Md Atiqur Rahman and Yang Wang. Optimizing intersection- over-union in deep neural networks for image segmentation. InInternational symposium on visual computing, pages 234–

  35. [35]

    DINOv3

    Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha¨el Ramamonjisoa, et al. DI- NOv3.arXiv preprint arXiv:2508.10104, 2025. 1, 2

  36. [36]

    Selected method of image analysis used in quality control of manufactured components.Tem Journal, 7 (2):281, 2018

    Luk´aˇs Vacho, Juraj Bal´aˇzi, Stanislav Pauloviˇc, and Frantiˇsek Adamovsk`y. Selected method of image analysis used in quality control of manufactured components.Tem Journal, 7 (2):281, 2018. 2

  37. [37]

    Solder joint recog- nition using mask R-CNN method.IEEE Transactions on Components, Packaging and Manufacturing Technology, 10 (3):525–530, 2020

    Hao Wu, Wenbin Gao, and Xiangrong Xu. Solder joint recog- nition using mask R-CNN method.IEEE Transactions on Components, Packaging and Manufacturing Technology, 10 (3):525–530, 2020. 1, 3

  38. [38]

    Unified perceptual parsing for scene understanding

    Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun. Unified perceptual parsing for scene understanding. In Proceedings of the European conference on computer vision (ECCV), pages 418–434, 2018. 5

  39. [39]

    Alvarez, and Ping Luo

    Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, and Ping Luo. Segformer: Simple and effi- cient design for semantic segmentation with transformers. In Proceedings of the 35th International Conference on Neural Information Processing Systems, Red Hook, NY , USA, 2021. Curran Associates Inc. 3

  40. [40]

    Integrated circuit bonding dis- tance inspection via hierarchical measurement structure.Sen- sors, 24(12), 2024

    Yuan Zhang, Chenghan Pu, Yanming Zhang, Muyuan Niu, Lifeng Hao, and Jun Wang. Integrated circuit bonding dis- tance inspection via hierarchical measurement structure.Sen- sors, 24(12), 2024. 1, 3

  41. [41]

    Image BERT pre-training with online tokenizer

    Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. Image BERT pre-training with online tokenizer. InInternational Conference on Learning Representations, 2022. 1, 2, 6

  42. [42]

    black- patch

    Zongwei Zhou, Md Mahfuzur Rahman Siddiquee, Nima Tajbakhsh, and Jianming Liang. UNet++: Redesigning skip connections to exploit multiscale features in image segmen- tation.IEEE transactions on medical imaging, 39(6):1856– 1867, 2019. 6 AOI-SSL: Self-Supervised Framework for Efficient Segmentation of Wire-bonded Semiconductors In Optical Inspection Supplem...