pith. machine review for the scientific record. sign in

arxiv: 2604.02905 · v1 · submitted 2026-04-03 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

UniSpector: Towards Universal Open-set Defect Recognition via Spectral-Contrastive Visual Prompting

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:17 UTC · model grok-4.3

classification 💻 cs.CV
keywords open-set defect recognitionvisual promptingcontrastive learningindustrial inspectionanomaly detectionangular manifoldspatial-spectral encodingretraining-free learning
0
0 comments X

The pith

UniSpector structures visual prompts into a semantically organized angular manifold to detect novel industrial defects without retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that visual prompting can scale to open-set defect recognition if the prompt space is deliberately organized rather than matched naively to image regions. Existing methods collapse under high intra-class variance and subtle inter-class differences in defect images. UniSpector counters this with a Spatial-Spectral Prompt Encoder that produces orientation-invariant fine-grained features and a Contrastive Prompt Encoder that regularizes those features into an angular manifold. Prompt-guided Query Selection then aligns object queries to the structured prompts. On the new Inspect Anything benchmark the approach raises AP50b and AP50m by at least 19.7 and 15.8 points over baselines while remaining retraining-free.

Core claim

UniSpector shifts visual prompting from direct region matching to the design of a transferable prompt topology. The Spatial-Spectral Prompt Encoder extracts orientation-invariant representations; the Contrastive Prompt Encoder explicitly arranges these representations on a semantically organized angular manifold; and Prompt-guided Query Selection produces adaptive queries aligned with that manifold. The resulting system performs open-set defect localization on the Inspect Anything benchmark at substantially higher AP50b and AP50m than prior prompting baselines.

What carries the argument

The Spatial-Spectral Prompt Encoder paired with the Contrastive Prompt Encoder, which together prevent embedding collapse and enforce a semantically organized angular manifold for prompts.

If this is right

  • Industrial inspection systems can add new defect classes by updating prompts alone rather than retraining entire models.
  • Localization accuracy improves by at least 19 percent on open-set benchmarks without sacrificing closed-set performance.
  • The same prompt topology design can be reused across multiple inspection sites or product lines.
  • Prompt-based pipelines become viable for continuously evolving manufacturing environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The angular-manifold construction may transfer to other prompt-based open-set tasks such as medical imaging or remote sensing if similar variance issues appear.
  • Explicit contrastive regularization of prompt space could become a standard module in future visual-prompting architectures beyond defect detection.
  • Real-time factory deployment would require measuring whether the manifold remains stable under lighting changes or camera drift not present in the benchmark.

Load-bearing premise

That spatial-spectral encoding plus contrastive regularization can still produce distinct angular clusters when defect images contain high intra-class variation and only subtle inter-class differences.

What would settle it

A controlled test set of defect images engineered with greater intra-class variance and finer inter-class distinctions than those in Inspect Anything, on which prompt embeddings are measured for collapse or loss of semantic separation.

Figures

Figures reproduced from arXiv: 2604.02905 by Geonuk Kim, Hyeonseong Jeon, Hyoungjoon Lim, Jeonghoon Han, Junho Yim, Kangil Lee, Minhoi Kim, Minsu Kim.

Figure 1
Figure 1. Figure 1: Comparison of visual inspection paradigms: (a) closed [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Examples from the InsA benchmark. Top: Samples from the same defect class showing high intra-class appearance vari￾ance. Bottom: Samples from different classes exhibiting similar visual patterns, resulting in low inter-class separability. Such am￾biguities highlight the inherent difficulty of defect recognition. 2. Related Works 2.1. Traditional defect detection and segmentation Modern industrial defect in… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of UniSpector, an open-set defect detection and segmentation framework. The Spatial–Spectral Prompt Encoder ex￾tracts orientation-invariant spectral cues fused with spatial features to distinguish visually similar defects. Building on these, Contrastive Prompt Encoding regularizes the prompt embedding space into a structured manifold for robust open-set generalization. A Prompt-guided Query Select… view at source ↗
Figure 4
Figure 4. Figure 4: 3D PCA projection of L2-normalized prompt embeddings learned by [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Intra-class cosine similarity comparison across seen and [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Effect of the number of prompt samples per defect class. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Distribution of Prompt-to-Target Ratios across Defect Classes. It illustrates the prompt-to-target ratio for each defect class (with [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: UniSpector is capable of recognizing unseen defects via visual prompts. (a) Orange box: user-specified prompt region. (b) Blue box: corresponding ground-truth in the target image. (c) Green boxes: correct predictions by UniSpector, accurately localizing subtle defects. (d) Red boxes: DINOv predictions, showing failure to localize the prompted defect [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Detailed view of the inference phase. Unlike the training [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Robustness against prompt annotation (averaged over [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
read the original abstract

Although industrial inspection systems should be capable of recognizing unprecedented defects, most existing approaches operate under a closed-set assumption, which prevents them from detecting novel anomalies. While visual prompting offers a scalable alternative for industrial inspection, existing methods often suffer from prompt embedding collapse due to high intra-class variance and subtle inter-class differences. To resolve this, we propose UniSpector, which shifts the focus from naive prompt-to-region matching to the principled design of a semantically structured and transferable prompt topology. UniSpector employs the Spatial-Spectral Prompt Encoder to extract orientation-invariant, fine-grained representations; these serve as a solid basis for the Contrastive Prompt Encoder to explicitly regularize the prompt space into a semantically organized angular manifold. Additionally, Prompt-guided Query Selection generates adaptive object queries aligned with the prompt. We introduce Inspect Anything, the first benchmark for visual-prompt-based open-set defect localization, where UniSpector significantly outperforms baselines by at least 19.7% and 15.8% in AP50b and AP50m, respectively. These results show that our method enable a scalable, retraining-free inspection paradigm for continuously evolving industrial environments, while offering critical insights into the design of generic visual prompting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes UniSpector for universal open-set defect recognition in industrial inspection via visual prompting. It introduces a Spatial-Spectral Prompt Encoder to extract orientation-invariant fine-grained features and a Contrastive Prompt Encoder to regularize the prompt space into a semantically organized angular manifold that resists collapse under high intra-class variance. A Prompt-guided Query Selection module generates adaptive queries, and the method is evaluated on a newly introduced Inspect Anything benchmark for visual-prompt-based open-set defect localization, where it reports gains of at least 19.7% AP50b and 15.8% AP50m over baselines, supporting a retraining-free inspection paradigm.

Significance. If the central claims hold, the work could meaningfully advance scalable, open-set industrial inspection by addressing prompt collapse in visual prompting and enabling detection of novel defects without retraining. The introduction of the Inspect Anything benchmark is a constructive contribution that could facilitate future research in prompt-based open-set localization.

major comments (2)
  1. [Method (Contrastive Prompt Encoder) and Experiments] The manuscript attributes the reported performance gains to the Contrastive Prompt Encoder creating a semantically organized angular manifold that prevents embedding collapse, yet provides no quantitative verification of this property (e.g., intra-class vs. inter-class cosine similarity statistics, angular separation metrics, or embedding visualizations) in the prompt space. This is load-bearing for the central claim, as the abstract and method description leave open the possibility that gains arise instead from the Prompt-guided Query Selection or benchmark-specific factors.
  2. [Experiments and Abstract] The abstract states significant outperformance on the Inspect Anything benchmark but the manuscript supplies insufficient experimental details on baseline implementations, ablation studies isolating each module's contribution, or controls for prompt collapse. Without these, the data-to-claim connection for the 19.7% AP50b and 15.8% AP50m improvements cannot be assessed.
minor comments (2)
  1. [Abstract] Define AP50b and AP50m explicitly (e.g., average precision at IoU threshold 0.5 for bounding boxes and masks) at first use.
  2. [Method] Clarify the exact loss formulation and temperature parameters in the contrastive regularization to allow reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that the current manuscript would benefit from additional quantitative verification of the prompt space properties and expanded experimental details. We will revise the paper accordingly to strengthen the evidence for our claims.

read point-by-point responses
  1. Referee: [Method (Contrastive Prompt Encoder) and Experiments] The manuscript attributes the reported performance gains to the Contrastive Prompt Encoder creating a semantically organized angular manifold that prevents embedding collapse, yet provides no quantitative verification of this property (e.g., intra-class vs. inter-class cosine similarity statistics, angular separation metrics, or embedding visualizations) in the prompt space. This is load-bearing for the central claim, as the abstract and method description leave open the possibility that gains arise instead from the Prompt-guided Query Selection or benchmark-specific factors.

    Authors: We acknowledge that the manuscript currently relies on the architectural description and overall performance gains without providing direct quantitative metrics on the prompt embeddings. In the revised version, we will add intra-class versus inter-class cosine similarity statistics, angular separation metrics (such as mean angular distances), and embedding visualizations (e.g., t-SNE or PCA plots) of the prompt space both with and without the Contrastive Prompt Encoder. These additions will explicitly demonstrate the formation of the semantically organized angular manifold and help rule out alternative explanations for the gains. revision: yes

  2. Referee: [Experiments and Abstract] The abstract states significant outperformance on the Inspect Anything benchmark but the manuscript supplies insufficient experimental details on baseline implementations, ablation studies isolating each module's contribution, or controls for prompt collapse. Without these, the data-to-claim connection for the 19.7% AP50b and 15.8% AP50m improvements cannot be assessed.

    Authors: We agree that the experimental section needs to be expanded for full reproducibility and to isolate contributions. In the revision, we will include complete implementation details for all baselines (including any modifications for the open-set setting), full ablation tables breaking down the impact of the Spatial-Spectral Prompt Encoder, Contrastive Prompt Encoder, and Prompt-guided Query Selection individually, and targeted controls for prompt collapse (e.g., variants with and without contrastive regularization, along with collapse metrics such as embedding variance). These will directly link the reported improvements to the proposed components. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical evaluation of novel architecture

full rationale

The paper introduces Spatial-Spectral Prompt Encoder and Contrastive Prompt Encoder as new components to organize prompt embeddings into an angular manifold, then reports performance gains on the newly introduced Inspect Anything benchmark. No equations, derivations, or self-citations are shown that reduce the claimed AP50 improvements to quantities defined by fitted parameters, self-referential normalizations, or prior author work. The derivation chain is self-contained: the method is proposed, implemented, and measured against baselines without any step that renames a fit as a prediction or imports uniqueness via self-citation. This matches the expected non-finding for papers whose central contribution is architectural and benchmark-driven.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract, the approach rests on standard computer-vision assumptions about prompt embeddings and contrastive regularization; no explicit free parameters, new physical entities, or ad-hoc axioms are stated.

axioms (1)
  • domain assumption Visual prompting can serve as a scalable alternative to closed-set training for industrial defect recognition
    The abstract positions visual prompting as the starting point and proposes modifications to it.

pith-pipeline@v0.9.0 · 5543 in / 1218 out tokens · 46303 ms · 2026-05-13T20:17:31.340006+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 1 internal anchor

  1. [1]

    Vision datasets: A bench- mark for vision-based industrial inspection.arXiv preprint arXiv:2306.07890, 2023

    Haoping Bai, Shancong Mou, Tatiana Likhomanenko, Ra- mazan Gokberk Cinbis, Oncel Tuzel, Ping Huang, Jiulong Shan, Jianjun Shi, and Meng Cao. Vision datasets: A bench- mark for vision-based industrial inspection.arXiv preprint arXiv:2306.07890, 2023. 6, 9, 10

  2. [2]

    Efficien- tad: Accurate visual anomaly detection at millisecond-level latencies

    Kilian Batzner, Lars Heckler, and Rebecca K ¨onig. Efficien- tad: Accurate visual anomaly detection at millisecond-level latencies. InProceedings of the IEEE/CVF winter confer- ence on applications of computer vision, pages 128–138,

  3. [3]

    Mvtec ad–a comprehensive real-world dataset for unsupervised anomaly detection

    Paul Bergmann, Michael Fauser, David Sattlegger, and Carsten Steger. Mvtec ad–a comprehensive real-world dataset for unsupervised anomaly detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9592–9600, 2019. 6, 9, 10

  4. [4]

    Masked-attention mask transformer for universal image segmentation

    Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexan- der Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1290–1299, 2022. 8

  5. [5]

    Yolo-world: Real-time open-vocabulary object detection

    Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xing- gang Wang, and Ying Shan. Yolo-world: Real-time open-vocabulary object detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16901–16911, 2024. 3, 5, 6

  6. [6]

    Padim: a patch distribution modeling framework for anomaly detection and localization

    Thomas Defard, Aleksandr Setkov, Angelique Loesch, and Romaric Audigier. Padim: a patch distribution modeling framework for anomaly detection and localization. InInter- national conference on pattern recognition, pages 475–489. Springer, 2021. 1

  7. [7]

    Arcface: Additive angular margin loss for deep face recognition

    Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 4690–4699, 2019. 4

  8. [8]

    Lerenet: Eliminating intra-class differences for metal surface defect few-shot semantic segmentation.arXiv preprint arXiv:2403.11122, 2024

    Hanze Ding, Zhangkai Wu, Jiyan Zhang, Ming Ping, and Yanfang Liu. Lerenet: Eliminating intra-class differences for metal surface defect few-shot semantic segmentation.arXiv preprint arXiv:2403.11122, 2024. 1, 3

  9. [9]

    Text-guided visual prompt dino for generic segmentation

    Yuchen Guan, Chong Sun, Canmiao Fu, Zhipeng Huang, Chun Yuan, and Chen Li. Text-guided visual prompt dino for generic segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 21288– 21298, 2025. 3

  10. [10]

    Category rela- tionship enhancement transformer for industrial defect seg- mentation.Knowledge-Based Systems, 326:114059, 2025

    Zican Hu, Jiaxiang Luo, and Zixiang Hong. Category rela- tionship enhancement transformer for industrial defect seg- mentation.Knowledge-Based Systems, 326:114059, 2025. 1, 3

  11. [11]

    Surface defect saliency of magnetic tile.The Visual Computer, 36(1):85–96,

    Yibin Huang, Congying Qiu, and Kui Yuan. Surface defect saliency of magnetic tile.The Visual Computer, 36(1):85–96,

  12. [12]

    T-rex2: Towards generic object detection via text-visual prompt synergy

    Qing Jiang, Feng Li, Zhaoyang Zeng, Tianhe Ren, Shilong Liu, and Lei Zhang. T-rex2: Towards generic object detection via text-visual prompt synergy. InEuropean Conference on Computer Vision, pages 38–57. Springer, 2024. 1, 3, 4, 5, 6, 8

  13. [13]

    Cin- former: Transformer network with multi-stage cnn feature injection for surface defect segmentation.arXiv preprint arXiv:2309.12639, 2023

    Xiaoheng Jiang, Kaiyi Guo, Yang Lu, Feng Yan, Hao Liu, Jiale Cao, Mingliang Xu, and Dacheng Tao. Cin- former: Transformer network with multi-stage cnn feature injection for surface defect segmentation.arXiv preprint arXiv:2309.12639, 2023. 2

  14. [14]

    Ultralytics yolo11, 2024

    Glenn Jocher and Jing Qiu. Ultralytics yolo11, 2024. 12

  15. [15]

    Few-shot object detection via feature reweighting

    Bingyi Kang, Zhuang Liu, Xin Wang, Fisher Yu, Jiashi Feng, and Trevor Darrell. Few-shot object detection via feature reweighting. InProceedings of the IEEE/CVF international conference on computer vision, pages 8420–8429, 2019. 10

  16. [16]

    YOLOv11: An Overview of the Key Architectural Enhancements

    Rahima Khanam and Muhammad Hussain. Yolov11: An overview of the key architectural enhancements.arXiv preprint arXiv:2410.17725, 2024. 8

  17. [17]

    Mask dino: Towards a unified transformer-based framework for object detection and segmentation

    Feng Li, Hao Zhang, Huaizhe Xu, Shilong Liu, Lei Zhang, Lionel M Ni, and Heung-Yeung Shum. Mask dino: Towards a unified transformer-based framework for object detection and segmentation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 3041–3050, 2023. 8, 12

  18. [18]

    Visual in-context prompting

    Feng Li, Qing Jiang, Hao Zhang, Tianhe Ren, Shilong Liu, Xueyan Zou, Huaizhe Xu, Hongyang Li, Jianwei Yang, Chunyuan Li, et al. Visual in-context prompting. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12861–12871, 2024. 1, 3, 4, 5, 6, 8

  19. [19]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuro- pean conference on computer vision, pages 38–55. Springer,

  20. [20]

    An adaptive image segmen- tation network for surface defect detection.IEEE Trans- actions on Neural Networks and Learning Systems, 35(6): 8510–8523, 2022

    Taiheng Liu, Zhaoshui He, Zhijie Lin, Guang-Zhong Cao, Wenqing Su, and Shengli Xie. An adaptive image segmen- tation network for surface defect detection.IEEE Trans- actions on Neural Networks and Learning Systems, 35(6): 8510–8523, 2022. 1, 2

  21. [21]

    Exploring few-shot defect segmentation in general industrial scenarios with metric learning and vision foundation models.arXiv preprint arXiv:2502.01216, 2025

    Tongkun Liu, Bing Li, Xiao Jin, Yupeng Shi, Qiuying Li, and Xiang Wei. Exploring few-shot defect segmentation in general industrial scenarios with metric learning and vision foundation models.arXiv preprint arXiv:2502.01216, 2025. 1, 3

  22. [22]

    A simple image segmentation framework via in-context examples.Advances in Neural Information Processing Systems, 37:25095–25119,

    Yang Liu, Chenchen Jing, Hengtao Li, Muzhi Zhu, Hao Chen, Xinlong Wang, and Chunhua Shen. A simple image segmentation framework via in-context examples.Advances in Neural Information Processing Systems, 37:25095–25119,

  23. [23]

    Unified open-world segmentation with multi-modal prompts

    Yang Liu, Yufei Yin, Chenchen Jing, Muzhi Zhu, Hao Chen, Yuling Xi, Bo Feng, Hao Wang, Shiyu Li, and Chunhua Shen. Unified open-world segmentation with multi-modal prompts. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 21557–21567, 2025. 3, 4

  24. [24]

    Deep metallic surface defect detection: The new bench- mark and detection network.Sensors, 20(6):1562, 2020

    Xiaoming Lv, Fajie Duan, Jia-jia Jiang, Xiao Fu, and Lin Gan. Deep metallic surface defect detection: The new bench- mark and detection network.Sensors, 20(6):1562, 2020. 6, 9, 10

  25. [25]

    U- net: Convolutional networks for biomedical image segmen- tation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. InInternational Conference on Medical image com- puting and computer-assisted intervention, pages 234–241. Springer, 2015. 1, 2

  26. [26]

    Towards total recall in industrial anomaly detection

    Karsten Roth, Latha Pemula, Joaquin Zepeda, Bernhard Sch¨olkopf, Thomas Brox, and Peter Gehler. Towards total recall in industrial anomaly detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14318–14328, 2022. 1

  27. [27]

    Yoloe: Real-time seeing anything.arXiv preprint arXiv:2503.07465, 2025

    Ao Wang, Lihao Liu, Hui Chen, Zijia Lin, Jungong Han, and Guiguang Ding. Yoloe: Real-time seeing anything.arXiv preprint arXiv:2503.07465, 2025. 1, 3, 5, 6

  28. [28]

    Real-iad: A real-world multi-view dataset for benchmarking versatile industrial anomaly detec- tion

    Chengjie Wang, Wenbing Zhu, Bin-Bin Gao, Zhenye Gan, Jiangning Zhang, Zhihao Gu, Shuguang Qian, Mingang Chen, and Lizhuang Ma. Real-iad: A real-world multi-view dataset for benchmarking versatile industrial anomaly detec- tion. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 22883–22892,

  29. [29]

    Ssd-faster net: A hybrid network for industrial defect inspection.arXiv preprint arXiv:2207.00589, 2022

    Jingyao Wang and Naigong Yu. Ssd-faster net: A hybrid network for industrial defect inspection.arXiv preprint arXiv:2207.00589, 2022. 2

  30. [30]

    Defect transformer: An efficient hybrid trans- former architecture for surface defect detection.Measure- ment, 211:112614, 2023

    Junpu Wang, Guili Xu, Fuju Yan, Jinjin Wang, and Zheng- sheng Wang. Defect transformer: An efficient hybrid trans- former architecture for surface defect detection.Measure- ment, 211:112614, 2023. 1, 2

  31. [31]

    Frustratingly simple few-shot object detection.arXiv preprint arXiv:2003.06957, 2020

    Xin Wang, Thomas E Huang, Trevor Darrell, Joseph E Gon- zalez, and Fisher Yu. Frustratingly simple few-shot object detection.arXiv preprint arXiv:2003.06957, 2020. 10

  32. [32]

    Seggpt: Towards seg- menting everything in context

    Xinlong Wang, Xiaosong Zhang, Yue Cao, Wen Wang, Chunhua Shen, and Tiejun Huang. Seggpt: Towards seg- menting everything in context. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1130–1140, 2023. 1, 5, 6

  33. [33]

    Wavelet and prototype augmented query- based transformer for pixel-level surface defect detection

    Feng Yan, Xiaoheng Jiang, Yang Lu, Jiale Cao, Dong Chen, and Mingliang Xu. Wavelet and prototype augmented query- based transformer for pixel-level surface defect detection. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 23860–23869, 2025. 1, 2

  34. [34]

    Meta r-cnn: Towards general solver for instance-level low-shot learning

    Xiaopeng Yan, Ziliang Chen, Anni Xu, Xiaoxi Wang, Xi- aodan Liang, and Liang Lin. Meta r-cnn: Towards general solver for instance-level low-shot learning. InProceedings of the IEEE/CVF international conference on computer vi- sion, pages 9577–9586, 2019. 10

  35. [35]

    3cad: A large-scale real- world 3c product dataset for unsupervised anomaly detec- tion

    Enquan Yang, Peng Xing, Hanyang Sun, Wenbo Guo, Yuan- wei Ma, Zechao Li, and Dan Zeng. 3cad: A large-scale real- world 3c product dataset for unsupervised anomaly detec- tion. InProceedings of the AAAI Conference on Artificial Intelligence, pages 9175–9183, 2025. 6, 9, 10

  36. [36]

    Dual wavelet attention networks for image classifi- cation.IEEE Transactions on Circuits and Systems for Video Technology, 33(4):1899–1910, 2022

    Yuting Yang, Licheng Jiao, Xu Liu, Fang Liu, Shuyuan Yang, Lingling Li, Puhua Chen, Xiufang Li, and Zhongjian Huang. Dual wavelet attention networks for image classifi- cation.IEEE Transactions on Circuits and Systems for Video Technology, 33(4):1899–1910, 2022. 2

  37. [37]

    De- trs beat yolos on real-time object detection

    Yian Zhao, Wenyu Lv, Shangliang Xu, Jinman Wei, Guanzhong Wang, Qingqing Dang, Yi Liu, and Jie Chen. De- trs beat yolos on real-time object detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16965–16974, 2024. 8

  38. [38]

    Segment everything everywhere all at once.Advances in neural information processing systems, 36:19769–19782,

    Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, Jianfeng Wang, Lijuan Wang, Jianfeng Gao, and Yong Jae Lee. Segment everything everywhere all at once.Advances in neural information processing systems, 36:19769–19782,

  39. [39]

    Spot-the-difference self-supervised pre- training for anomaly detection and segmentation.arXiv preprint arXiv:2207.14315, 2022

    Yang Zou, Jongheon Jeong, Latha Pemula, Dongqing Zhang, and Onkar Dabeer. Spot-the-difference self-supervised pre- training for anomaly detection and segmentation.arXiv preprint arXiv:2207.14315, 2022. 6, 9, 10