pith. machine review for the scientific record. sign in

arxiv: 2603.13779 · v2 · submitted 2026-03-14 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

AD-Copilot: A Vision-Language Assistant for Industrial Anomaly Detection via Visual In-context Comparison

Authors on Pith no claims yet

Pith reviewed 2026-05-15 11:50 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords industrial anomaly detectionvision-language modelmultimodal LLMdefect localizationin-context comparisonMMAD benchmark
0
0 comments X

The pith

AD-Copilot detects industrial anomalies by comparing image pairs visually rather than through language alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AD-Copilot as a vision-language model built for industrial anomaly detection. General multimodal models struggle here because they are trained on web data and compare images only in language space, missing subtle visual differences. AD-Copilot adds a Comparison Encoder that applies cross-attention directly to paired image features and trains it in stages on a new dataset called Chat-AD, which is curated from sparsely labeled industrial images for captioning, visual question answering, and defect localization. This produces 82.3 percent accuracy on the MMAD benchmark, up to 3.35 times better localization on the new MMAD-BBox test, and performance that exceeds human experts on several tasks while generalizing to other benchmarks.

Core claim

AD-Copilot achieves 82.3 percent accuracy on the MMAD benchmark and a maximum 3.35 times improvement on MMAD-BBox by incorporating a Comparison Encoder that uses cross-attention between paired image features to enable fine-grained visual comparison, trained through a multi-stage process on the Chat-AD dataset mined from sparsely labeled industrial images for captioning, VQA, and defect localization.

What carries the argument

Comparison Encoder that applies cross-attention between features of paired images to support multi-image fine-grained perception during anomaly detection.

Load-bearing premise

The data curation pipeline extracts accurate inspection knowledge from sparsely labeled industrial images and produces high-quality training samples without introducing major label noise or domain mismatch.

What would settle it

A controlled experiment that replaces the cross-attention Comparison Encoder with independent image encoding and measures whether accuracy on subtle defect pairs falls back to baseline levels would test whether the visual comparison mechanism drives the reported gains.

Figures

Figures reproduced from arXiv: 2603.13779 by Bin-Bin Gao, Chengjie Wang, Feng Zheng, Hanqiu Deng, Heng Zhao, Jian Li, Jun Liu, Xi Jiang, Yong Liu, Yue Guo.

Figure 1
Figure 1. Figure 1: AD-Copilot was trained on meticulously curated datasets, including high-quality synthetic IAD caption and VQA data, together with general-domain [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The pipeline for Chat-AD synthesis. Each query image is paired with a normal template image, and the generated text explicitly describes visual [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The model architecture of AD-Copilot. The Comparison Encoder performs cross-attention between paired image features to generate comparison [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Task and metric of MMAD-BBox. VI. MMAD-BBOX BENCHMARK Fine localization of anomalies is crucial in practice, as it helps distinguish defects and provides actionable visual￾ization [17]. However, previous MLLMs lack the grounding ability, and earlier benchmarks [8, 14] did not support this task. Recent MLLMs can already output bounding box coordinates as text, which highlights the gap between model capabili… view at source ↗
Figure 5
Figure 5. Figure 5: Data scaling behavior of AD-Copilot. Performance improves with [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of the attention ratio between base tokens and compar [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative evaluation of AD-Copilot compared with other models across various IAD tasks. [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
read the original abstract

Multimodal Large Language Models (MLLMs) have achieved impressive success in natural visual understanding, yet they consistently underperform in industrial anomaly detection (IAD). This is because MLLMs trained mostly on general web data differ significantly from industrial images. Moreover, they encode each image independently and can only compare images in the language space, making them insensitive to subtle visual differences that are key to IAD. To tackle these issues, we present AD-Copilot, an interactive MLLM specialized for IAD via visual in-context comparison. We first design a novel data curation pipeline to mine inspection knowledge from sparsely labeled industrial images and generate precise samples for captioning, VQA, and defect localization, yielding a large-scale multimodal dataset Chat-AD rich in semantic signals for IAD. On this foundation, AD-Copilot incorporates a novel Comparison Encoder that employs cross-attention between paired image features to enhance multi-image fine-grained perception, and is trained with a multi-stage strategy that incorporates domain knowledge and gradually enhances IAD skills. In addition, we introduce MMAD-BBox, an extended benchmark for anomaly localization with bounding-box-based evaluation. The experiments show that AD-Copilot achieves 82.3% accuracy on the MMAD benchmark, outperforming all other models without any data leakage. In the MMAD-BBox test, it achieves a maximum improvement of $3.35\times$ over the baseline. AD-Copilot also exhibits excellent generalization of its performance gains across other specialized and general-purpose benchmarks. Remarkably, AD-Copilot surpasses human expert-level performance on several IAD tasks, demonstrating its potential as a reliable assistant for real-world industrial inspection. All datasets and models will be released for the broader benefit of the community.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces AD-Copilot, a specialized multimodal large language model for industrial anomaly detection (IAD) that performs visual in-context comparison. It proposes a novel data curation pipeline to extract inspection knowledge from sparsely labeled industrial images and generate the Chat-AD dataset for captioning, VQA, and defect localization tasks. The architecture includes a Comparison Encoder using cross-attention between paired image features, trained via a multi-stage strategy incorporating domain knowledge. A new MMAD-BBox benchmark for bounding-box anomaly localization is introduced. Reported results include 82.3% accuracy on MMAD (outperforming baselines without data leakage), up to 3.35× improvement on MMAD-BBox, generalization across other benchmarks, and surpassing human expert performance on several IAD tasks, with plans to release datasets and models.

Significance. If the empirical claims hold after verification of data quality and no leakage, the work would advance the application of MLLMs to specialized industrial domains by addressing the domain gap and the limitation of language-only comparison for subtle visual differences. The release of Chat-AD and MMAD-BBox would provide reusable resources for the community. The multi-stage training and visual cross-attention mechanism represent a targeted adaptation that could inform future domain-specific MLLM development.

major comments (2)
  1. [Data Curation Pipeline] The headline results (82.3% MMAD accuracy, 3.35× MMAD-BBox gain, human-expert surpassing) depend entirely on the data curation pipeline producing high-quality samples without significant label noise or domain mismatch. No quantitative validation (noise rate, inter-annotator agreement, or domain-shift metrics) is reported for the mined Chat-AD data, leaving the foundation of all gains unverified.
  2. [Experiments] The claim of 'without any data leakage' for the 82.3% MMAD accuracy and generalization results is load-bearing but unsupported by details on train/test split construction or leakage verification for the mined industrial images.
minor comments (2)
  1. [Abstract] The abstract asserts 'excellent generalization' across benchmarks but omits specific accuracy numbers or tables summarizing those results, reducing clarity for readers.
  2. [Method] Notation for the Comparison Encoder's cross-attention (e.g., how paired features are fused before the language model) could be formalized with an equation to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment below and will revise the manuscript to provide the requested clarifications and additional validation details.

read point-by-point responses
  1. Referee: [Data Curation Pipeline] The headline results (82.3% MMAD accuracy, 3.35× MMAD-BBox gain, human-expert surpassing) depend entirely on the data curation pipeline producing high-quality samples without significant label noise or domain mismatch. No quantitative validation (noise rate, inter-annotator agreement, or domain-shift metrics) is reported for the mined Chat-AD data, leaving the foundation of all gains unverified.

    Authors: We agree that explicit quantitative validation strengthens the claims. The manuscript describes the curation pipeline but does not report noise rates, inter-annotator agreement, or domain-shift metrics. In the revision we will add these: (i) noise rate estimates from manual inspection of 500 randomly sampled instances, (ii) inter-annotator agreement on a subset of VQA and localization annotations, and (iii) domain-shift metrics (e.g., Fréchet distance on CLIP features) between Chat-AD and the MMAD test distribution. revision: yes

  2. Referee: [Experiments] The claim of 'without any data leakage' for the 82.3% MMAD accuracy and generalization results is load-bearing but unsupported by details on train/test split construction or leakage verification for the mined industrial images.

    Authors: We acknowledge that the current text states the no-leakage claim without sufficient procedural detail. The splits were constructed by source-level partitioning and hash-based deduplication to ensure no image or caption overlap. In the revision we will add a dedicated paragraph (and appendix table) that specifies the exact partitioning rules, deduplication method, and verification steps performed to confirm zero leakage between Chat-AD training data and all reported test sets. revision: yes

Circularity Check

0 steps flagged

No circularity: results rest on external empirical benchmarks

full rationale

The paper presents an empirical system: a data-curation pipeline that produces the Chat-AD dataset, a Comparison Encoder using cross-attention, and multi-stage training. All headline numbers (82.3 % MMAD accuracy, 3.35× MMAD-BBox gain, human-expert surpassing) are obtained by evaluating the trained model on held-out external benchmarks (MMAD, MMAD-BBox, and others) with an explicit “no data leakage” claim. No equation, parameter, or prediction is defined in terms of itself; the derivation chain consists of standard supervised training followed by independent test-set measurement. Therefore the central claims do not reduce to tautology or self-citation load-bearing.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on standard machine learning assumptions about transfer learning from general MLLMs and the utility of cross-attention for multi-image comparison; no new physical entities or ad-hoc constants are introduced beyond typical training hyperparameters.

free parameters (1)
  • Multi-stage training hyperparameters
    The multi-stage training strategy incorporates domain knowledge and likely requires tuned learning rates, loss weights, and stage durations fitted during development.
axioms (1)
  • domain assumption Base MLLMs can acquire fine-grained industrial anomaly detection skills through supervised fine-tuning on curated multimodal data
    Invoked when stating that the model is trained with a multi-stage strategy to enhance IAD skills.

pith-pipeline@v0.9.0 · 5643 in / 1407 out tokens · 52037 ms · 2026-05-15T11:50:39.508880+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. IAD-Unify: A Region-Grounded Unified Model for Industrial Anomaly Segmentation, Understanding, and Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    IAD-Unify unifies industrial anomaly segmentation, region-grounded language understanding, and mask-guided generation in one framework using DINOv2 token injection into Qwen3.5, supported by the new Anomaly-56K datase...

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · cited by 1 Pith paper · 12 internal anchors

  1. [1]

    On domain-adaptive post-training for multimodal large language models,

    D. Cheng, S. Huang, Z. Zhu, X. Zhang, W. X. Zhao, Z. Luan, B. Dai, and Z. Zhang, “On domain-adaptive post-training for multimodal large language models,” arXiv preprint arXiv:2411.19930, 2024

  2. [2]

    Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

    W. Xu, H. P. Chan, L. Li, M. Aljunied, R. Yuan, J. Wang, C. Xiao, G. Chen, C. Liu, Z. Liet al., “Lingshu: A generalist foundation model for unified multimodal medical understanding and reasoning,”arXiv preprint arXiv:2506.07044, 2025

  3. [3]

    Qwen2.5-VL Technical Report

    S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tanget al., “Qwen2. 5-vl technical report,”arXiv preprint arXiv:2502.13923, 2025

  4. [4]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei, Z. Liu, L. Jing, S. Ye, J. Shaoet al., “Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency,”arXiv preprint arXiv:2508.18265, 2025

  5. [5]

    Seed1.5-VL Technical Report

    D. Guo, F. Wu, F. Zhu, F. Leng, G. Shi, H. Chen, H. Fan, J. Wang, J. Jiang, J. Wanget al., “Seed1. 5-vl technical report,”arXiv preprint arXiv:2505.07062, 2025

  6. [6]

    A survey of visual sensory anomaly detection,

    X. Jiang, G. Xie, J. Wang, Y . Liu, C. Wang, F. Zheng, and Y . Jin, “A survey of visual sensory anomaly detection,” arXiv preprint arXiv:2202.07006, 2022

  7. [7]

    Softpatch+: Fully unsupervised anomaly classification and segmentation,

    C. Wang, X. Jiang, B.-B. Gao, Z. Gan, Y . Liu, F. Zheng, and L. Ma, “Softpatch+: Fully unsupervised anomaly classification and segmentation,”Pattern Recognition, vol. 161, p. 111295, 2025

  8. [8]

    Mmad: A comprehensive benchmark for multimodal large language models in industrial anomaly detection,

    X. Jiang, J. Li, H. Deng, Y . Liu, B.-B. Gao, Y . Zhou, J. Li, C. Wang, and F. Zheng, “Mmad: A comprehensive benchmark for multimodal large language models in industrial anomaly detection,” inThe Thirteenth Inter- national Conference on Learning Representations, 2025

  9. [9]

    Efficient multi- modal large language models: A survey,

    Y . Jin, J. Li, Y . Liu, T. Gu, K. Wu, Z. Jiang, M. He, B. Zhao, X. Tan, Z. Ganet al., “Efficient multi- modal large language models: A survey,”arXiv preprint arXiv:2405.10739, 2024

  10. [10]

    Ader: A comprehensive benchmark for multi-class visual anomaly detection,

    J. Zhang, H. He, Z. Gan, Q. He, Y . Cai, Z. Xue, Y . Wang, C. Wang, L. Xie, and Y . Liu, “Ader: A comprehensive benchmark for multi-class visual anomaly detection,” arXiv preprint arXiv:2406.03262, vol. 2, no. 3, 2024

  11. [11]

    Can multimodal large language models be guided to im- prove industrial anomaly detection?

    Z. Chen, H. Chen, M. Imani, and F. Imani, “Can multimodal large language models be guided to im- prove industrial anomaly detection?”arXiv preprint arXiv:2501.15795, 2025

  12. [12]

    Promptad: Learning prompts with only normal samples for few-shot anomaly detection,

    X. Li, Z. Zhang, X. Tan, C. Chen, Y . Qu, Y . Xie, and L. Ma, “Promptad: Learning prompts with only normal samples for few-shot anomaly detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 16 838– 16 848

  13. [13]

    Manta: A large-scale multi-view and visual-text anomaly detection dataset for tiny ob- jects,

    L. Fan, D. Fan, Z. Hu, Y . Ding, D. Di, K. Yi, M. Pag- nucco, and Y . Song, “Manta: A large-scale multi-view and visual-text anomaly detection dataset for tiny ob- jects,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 25 518–25 527

  14. [14]

    Towards zero-shot anomaly detection and reasoning with multimodal large language models,

    J. Xu, S.-Y . Lo, B. Safaei, V . M. Patel, and I. Dwivedi, “Towards zero-shot anomaly detection and reasoning with multimodal large language models,” inProceedings of the Computer Vision and Pattern Recognition Confer- ence, 2025, pp. 20 370–20 382

  15. [15]

    Triad: Empowering lmm-based anomaly detection with expert-guided region-of-interest tokenizer and manufacturing process,

    Y . Li, S. Yuan, H. Wang, Q. Li, M. Liu, C. Xu, G. Shi, and W. Zuo, “Triad: Empowering lmm-based anomaly detection with expert-guided region-of-interest tokenizer and manufacturing process,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 21 917–21 926

  16. [16]

    Anomalygpt: Detecting industrial anomalies using large vision-language models,

    Z. Gu, B. Zhu, G. Zhu, Y . Chen, M. Tang, and J. Wang, “Anomalygpt: Detecting industrial anomalies using large vision-language models,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, 2024, pp. 1932–1940

  17. [17]

    Eiad: Explainable industrial anomaly detection via multi-modal large language models,

    Z. Zhang, J. Ruan, X. Gao, T. Liu, and Y . Fu, “Eiad: Explainable industrial anomaly detection via multi-modal large language models,”arXiv preprint arXiv:2503.14162, 2025

  18. [18]

    Mvtec ad–a comprehensive real-world dataset for un- supervised anomaly detection,

    P. Bergmann, M. Fauser, D. Sattlegger, and C. Steger, “Mvtec ad–a comprehensive real-world dataset for un- supervised anomaly detection,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 9592–9600

  19. [19]

    Spot-the-difference self-supervised pre-training for anomaly detection and segmentation,

    Y . Zou, J. Jeong, L. Pemula, D. Zhang, and O. Dabeer, “Spot-the-difference self-supervised pre-training for anomaly detection and segmentation,” inEuropean Con- ference on Computer Vision. Springer, 2022, pp. 392– 408

  20. [20]

    OpenAI o1 System Card

    OpenAI, A. Jaech, A. Kalai, A. Lerer, A. Richardson, A. El-Kishky, A. Low, A. Helyar, A. Madry, A. Beutel, A. Carneyet al., “Openai o1 system card,”arXiv preprint arXiv:2412.16720, 2024

  21. [21]

    Deepseek-r1 incentivizes reasoning in llms through reinforcement learning,

    DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi et al., “Deepseek-r1 incentivizes reasoning in llms through reinforcement learning,”Nature, vol. 645, 633–638, 2025. [Online]. Available: https://doi.org/10. 1038/s41586-025-09422-z

  22. [22]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wuet al., “Deepseekmath: Pushing the limits of mathematical reasoning in open lan- guage models,”arXiv preprint arXiv:2402.03300, 2024

  23. [23]

    Anomalyr1: A grpo- IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, MONTH 20XX 12 based end-to-end mllm for industrial anomaly detection,

    Y . Chao, J. Liu, J. Tang, and G. Wu, “Anomalyr1: A grpo- IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, MONTH 20XX 12 based end-to-end mllm for industrial anomaly detection,” arXiv preprint arXiv:2504.11914, 2025

  24. [24]

    Omniad: Detect and understand industrial anomaly via multimodal reasoning,

    S. Zhao, Y . Lin, L. Han, Y . Zhao, and Y . Wei, “Omniad: Detect and understand industrial anomaly via multimodal reasoning,”arXiv preprint arXiv:2505.22039, 2025

  25. [25]

    Emit: Enhancing mllms for industrial anomaly detection via difficulty-aware grpo,

    W. Guan, J. Lan, J. Cao, H. Tan, H. Zhu, and W. Wang, “Emit: Enhancing mllms for industrial anomaly detection via difficulty-aware grpo,”arXiv preprint arXiv:2507.21619, 2025

  26. [26]

    Ad-fm: Multimodal llms for anomaly detection via multi-stage reasoning and fine-grained reward optimization,

    J. Liao, Y . Su, R.-C. Tu, Z. Jin, W. Sun, Y . Li, D. Tao, X. Xu, and X. Yang, “Ad-fm: Multimodal llms for anomaly detection via multi-stage reasoning and fine-grained reward optimization,”arXiv preprint arXiv:2508.04175, 2025

  27. [27]

    End-to-end object detection with transformers,

    N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kir- illov, and S. Zagoruyko, “End-to-end object detection with transformers,” inEuropean conference on computer vision. Springer, 2020, pp. 213–229

  28. [28]

    Accurate industrial anomaly detection and localization using weakly-supervised residual transformers,

    H. Li, J. Wu, D. Liu, L. Y . Wu, H. Chen, and C. Shen, “Accurate industrial anomaly detection and localization using weakly-supervised residual transformers,”IEEE Transactions on Image Processing, 2026

  29. [29]

    Unipcb: A unified vision-language benchmark for open-ended pcb quality inspection,

    F. Sun, X. Jiang, J. Wuet al., “Unipcb: A unified vision-language benchmark for open-ended pcb quality inspection,”arXiv preprint arXiv:2601.19222, 2026

  30. [30]

    Omnidiff: A comprehensive benchmark for fine-grained image difference captioning,

    Y . Liu, S. Hou, S. Hou, J. Du, S. Meng, and Y . Huang, “Omnidiff: A comprehensive benchmark for fine-grained image difference captioning,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 21 440–21 449

  31. [31]

    Softpatch: Unsupervised anomaly detection with noisy data,

    X. Jiang, J. Liu, J. Wang, Q. Nie, K. Wu, Y . Liu, C. Wang, and F. Zheng, “Softpatch: Unsupervised anomaly detection with noisy data,”Advances in Neural Information Processing Systems, vol. 35, pp. 15 433– 15 445, 2022

  32. [32]

    Anomaly detection via reverse distillation from one-class embedding,

    H. Deng and X. Li, “Anomaly detection via reverse distillation from one-class embedding,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 9737–9746

  33. [33]

    Efficientad: Accurate visual anomaly detection at millisecond-level latencies,

    K. Batzner, L. Heckler, and R. K ¨onig, “Efficientad: Accurate visual anomaly detection at millisecond-level latencies,” inProceedings of the IEEE/CVF winter con- ference on applications of computer vision, 2024, pp. 128–138

  34. [34]

    Focus- patch ad: Few-shot multi-class anomaly detection with unified keywords patch prompts,

    X. Ding, X. Li, M. Chen, J. Gong, and Y . Xie, “Focus- patch ad: Few-shot multi-class anomaly detection with unified keywords patch prompts,”IEEE Transactions on Image Processing, vol. 35, pp. 112–123, 2025

  35. [35]

    A unified model for multi-class anomaly detection,

    Z. You, L. Cui, Y . Shen, K. Yang, X. Lu, Y . Zheng, and X. Le, “A unified model for multi-class anomaly detection,”Advances in Neural Information Processing Systems, vol. 35, pp. 4571–4584, 2022

  36. [36]

    Toward multi-class anomaly detection: Exploring class-aware unified model against inter-class interference,

    X. Jiang, Y . Chen, Q. Nie, J. Liu, Y . Liu, C. Wang, and F. Zheng, “Toward multi-class anomaly detection: Exploring class-aware unified model against inter-class interference,”arXiv preprint arXiv:2403.14213, 2024

  37. [37]

    Normal- abnormal guided generalist anomaly detection,

    Y . Wang, X. Wang, Y . Gong, and J. Xiao, “Normal- abnormal guided generalist anomaly detection,”arXiv preprint arXiv:2510.00495, 2025

  38. [38]

    Seas: few-shot industrial anomaly image generation with separation and sharing fine-tuning,

    Z. Dai, S. Zeng, H. Liu, X. Li, F. Xue, and Y . Zhou, “Seas: few-shot industrial anomaly image generation with separation and sharing fine-tuning,”arXiv preprint arXiv:2410.14987, 2024

  39. [39]

    Musc: Zero-shot industrial anomaly classification and segmentation with mutual scoring of the unlabeled images,

    X. Li, Z. Huang, F. Xue, and Y . Zhou, “Musc: Zero-shot industrial anomaly classification and segmentation with mutual scoring of the unlabeled images,” inThe Twelfth International Conference on Learning Representations, 2024

  40. [40]

    Winclip: Zero-/few-shot anomaly classifica- tion and segmentation,

    J. Jeong, Y . Zou, T. Kim, D. Zhang, A. Ravichandran, and O. Dabeer, “Winclip: Zero-/few-shot anomaly classifica- tion and segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19 606–19 616

  41. [41]

    Adaclip: Adapting clip with hybrid learnable prompts for zero-shot anomaly detection,

    Y . Cao, J. Zhang, L. Frittoli, Y . Cheng, W. Shen, and G. Boracchi, “Adaclip: Adapting clip with hybrid learnable prompts for zero-shot anomaly detection,” in European Conference on Computer Vision. Springer, 2024, pp. 55–72

  42. [42]

    Anoma- lyclip: Object-agnostic prompt learning for zero-shot anomaly detection,

    Q. Zhou, G. Pang, Y . Tian, S. He, and J. Chen, “Anoma- lyclip: Object-agnostic prompt learning for zero-shot anomaly detection,”arXiv preprint arXiv:2310.18961, 2023

  43. [43]

    Multiads: Defect-aware supervision for multi- type anomaly detection and segmentation in zero-shot learning,

    Y . Sadikaj, H. Zhou, L. Halilaj, S. Schmid, S. Staab, and C. Plant, “Multiads: Defect-aware supervision for multi- type anomaly detection and segmentation in zero-shot learning,”arXiv preprint arXiv:2504.06740, 2025

  44. [44]

    Towards generic anomaly detection and understanding: Large- scale visual-linguistic model (gpt-4v) takes the lead,

    Y . Cao, X. Xu, C. Sun, X. Huang, and W. Shen, “Towards generic anomaly detection and understanding: Large- scale visual-linguistic model (gpt-4v) takes the lead,” arXiv preprint arXiv:2311.02782, 2023

  45. [45]

    Cus- tomizing visual-language foundation models for multi- modal anomaly detection and reasoning,

    X. Xu, Y . Cao, H. Zhang, N. Sang, and X. Huang, “Cus- tomizing visual-language foundation models for multi- modal anomaly detection and reasoning,” in2025 28th International Conference on Computer Supported Coop- erative Work in Design (CSCWD). IEEE, 2025, pp. 1443–1448

  46. [46]

    Do llms understand visual anomalies? uncovering llm’s capabil- ities in zero-shot anomaly detection,

    J. Zhu, S. Cai, F. Deng, B. C. Ooi, and J. Wu, “Do llms understand visual anomalies? uncovering llm’s capabil- ities in zero-shot anomaly detection,” inProceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 48–57

  47. [47]

    Gpt-4v-ad: Exploring grounding potential of vqa-oriented gpt-4v for zero-shot anomaly detection,

    J. Zhang, H. He, X. Chen, Z. Xue, Y . Wang, C. Wang, L. Xie, and Y . Liu, “Gpt-4v-ad: Exploring grounding potential of vqa-oriented gpt-4v for zero-shot anomaly detection,” inInternational Joint Conference on Artificial Intelligence. Springer, 2024, pp. 3–16

  48. [48]

    Detect, classify, act: Categorizing industrial anomalies with multi-modal large language models,

    S. Mokhtar, A. Mousakhan, S. Galesso, J. Tayyub, and T. Brox, “Detect, classify, act: Categorizing industrial anomalies with multi-modal large language models,” in Proceedings of the Computer Vision and Pattern Recog- nition Conference, 2025, pp. 4058–4067

  49. [49]

    Iad-r1: Reinforcing consistent reasoning in industrial anomaly detection,

    Y . Li, Y . Cao, C. Liu, Y . Xiong, X. Dong, and C. Huang, “Iad-r1: Reinforcing consistent reasoning in industrial anomaly detection,”arXiv preprint arXiv:2508.09178, 2025

  50. [50]

    Lr-iad: Mask- IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, MONTH 20XX 13 free industrial anomaly detection with logical reasoning,

    P. Zeng, F. Pang, Z. Wang, and A. Yang, “Lr-iad: Mask- IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, MONTH 20XX 13 free industrial anomaly detection with logical reasoning,” arXiv preprint arXiv:2504.19524, 2025

  51. [51]

    A transformer- based siamese network for change detection,

    W. G. C. Bandara and V . M. Patel, “A transformer- based siamese network for change detection,” inIEEE In- ternational Geoscience and Remote Sensing Symposium (IGARSS), 2022, pp. 207–210

  52. [52]

    Robust change captioning,

    D. H. Park, T. Darrell, and A. Rohrbach, “Robust change captioning,” inProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, 2019, pp. 4624– 4633

  53. [53]

    Learning to de- scribe differences between pairs of similar images,

    H. Jhamtani and T. Berg-Kirkpatrick, “Learning to de- scribe differences between pairs of similar images,” in Proceedings of the 2018 Conference on Empirical Meth- ods in Natural Language Processing (EMNLP), 2018

  54. [54]

    Changes to captions: An attentive network for remote sensing change captioning,

    S. Chang and P. Ghamisi, “Changes to captions: An attentive network for remote sensing change captioning,” IEEE Transactions on Image Processing, vol. 32, pp. 6047–6060, 2023

  55. [55]

    Mag- icbrush: A manually annotated dataset for instruction- guided image editing,

    K. Zhang, L. Mo, W. Chen, H. Sun, and Y . Su, “Mag- icbrush: A manually annotated dataset for instruction- guided image editing,” inAdvances in Neural Informa- tion Processing Systems, 2023

  56. [56]

    Onediff: A generalist model for image difference cap- tioning,

    E. Hu, L. Guo, T. Yue, Z. Zhao, S. Xue, and J. Liu, “Onediff: A generalist model for image difference cap- tioning,” inProceedings of the Asian Conference on Computer Vision (ACCV), 2024, pp. 2439–2455

  57. [57]

    Img-diff: Contrastive data synthesis for multimodal large language models,

    Q. Jiao, D. Chen, Y . Huang, B. Ding, Y . Li, and Y . Shen, “Img-diff: Contrastive data synthesis for multimodal large language models,”arXiv preprint arXiv:2408.04594, 2024

  58. [58]

    Coft-ad: Contrastive fine-tuning for few-shot anomaly detection,

    J. Liao, X. Xu, M. C. Nguyen, A. Goodge, and C. S. Foo, “Coft-ad: Contrastive fine-tuning for few-shot anomaly detection,”IEEE Transactions on Image Processing, vol. 33, pp. 2090–2103, 2024

  59. [59]

    A multi-category anomaly edit- ing network with correlation exploration and voxel-level attention for unsupervised surface anomaly detection,

    R. Zhang and H.-M. Hu, “A multi-category anomaly edit- ing network with correlation exploration and voxel-level attention for unsupervised surface anomaly detection,” IEEE Transactions on Image Processing, 2025

  60. [60]

    Ad3: Introducing a score for anomaly detection dataset difficulty assessment using viaduct dataset,

    J. Lehr, J. Philipps, A. Sargsyan, M. Pape, and J. Kr ¨uger, “Ad3: Introducing a score for anomaly detection dataset difficulty assessment using viaduct dataset,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 449–464

  61. [61]

    Real-iad: A real-world multi-view dataset for benchmarking versatile industrial anomaly detection,

    C. Wang, W. Zhu, B.-B. Gao, Z. Gan, J. Zhang, Z. Gu, S. Qian, M. Chen, and L. Ma, “Real-iad: A real-world multi-view dataset for benchmarking versatile industrial anomaly detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 22 883–22 892

  62. [62]

    3cad: A large-scale real-world 3c product dataset for unsupervised anomaly detection,

    E. Yang, P. Xing, H. Sun, W. Guo, Y . Ma, Z. Li, and D. Zeng, “3cad: A large-scale real-world 3c product dataset for unsupervised anomaly detection,” inProceed- ings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 9, 2025, pp. 9175–9183

  63. [63]

    Deep learning-based defect detection of metal parts: evaluating current methods in complex conditions,

    S. Jezek, M. Jonak, R. Burget, P. Dvorak, and M. Sko- tak, “Deep learning-based defect detection of metal parts: evaluating current methods in complex conditions,” in2021 13th International congress on ultra modern telecommunications and control systems and workshops (ICUMT). IEEE, 2021, pp. 66–71

  64. [64]

    Beyond dents and scratches: Logical constraints in unsupervised anomaly detection and local- ization,

    P. Bergmann, K. Batzner, M. Fauser, D. Sattlegger, and C. Steger, “Beyond dents and scratches: Logical constraints in unsupervised anomaly detection and local- ization,”International Journal of Computer Vision, vol. 130, no. 4, pp. 947–969, 2022

  65. [65]

    Pku-goodsad: A supermarket goods dataset for unsupervised anomaly detection and segmentation,

    J. Zhang, R. Ding, M. Ban, and L. Dai, “Pku-goodsad: A supermarket goods dataset for unsupervised anomaly detection and segmentation,”IEEE Robotics and Automa- tion Letters, 2024

  66. [66]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

  67. [67]

    Llava-med: Training a large language-and-vision assistant for biomedicine in one day,

    C. Li, C. Wong, S. Zhang, N. Usuyama, H. Liu, J. Yang, T. Naumann, H. Poon, and J. Gao, “Llava-med: Training a large language-and-vision assistant for biomedicine in one day,”Advances in Neural Information Processing Systems, vol. 36, pp. 28 541–28 564, 2023

  68. [68]

    2 OLMo 2 Furious

    T. OLMo, P. Walsh, L. Soldaini, D. Groeneveld, K. Lo, S. Arora, A. Bhagia, Y . Gu, S. Huang, M. Jordanet al., “2 olmo 2 furious,”arXiv preprint arXiv:2501.00656, 2024

  69. [69]

    Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models,

    M. Deitke, C. Clark, S. Lee, R. Tripathi, Y . Yang, J. S. Park, M. Salehi, N. Muennighoff, K. Lo, L. Soldaini et al., “Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models,” inProceed- ings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 91–104

  70. [70]

    LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

    F. Li, R. Zhang, H. Zhang, Y . Zhang, B. Li, W. Li, Z. Ma, and C. Li, “Llava-next-interleave: Tackling multi- image, video, and 3d in large multimodal models,”arXiv preprint arXiv:2407.07895, 2024

  71. [71]

    SFT memorizes, RL generalizes: A comparative study of foundation model post-training,

    T. Chu, Y . Zhai, J. Yang, S. Tong, S. Xie, D. Schuurmans, Q. V . Le, S. Levine, and Y . Ma, “SFT memorizes, RL generalizes: A comparative study of foundation model post-training,” inForty-second International Conference on Machine Learning, 2025. [Online]. Available: https://openreview.net/forum?id=dYur3yabMj

  72. [72]

    Detect anything via next point prediction,

    Q. Jiang, J. Huo, X. Chen, Y . Xiong, Z. Zeng, Y . Chen, T. Ren, J. Yu, and L. Zhang, “Detect anything via next point prediction,”arXiv preprint arXiv:2510.12798, 2025

  73. [73]

    LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

    Y . Zheng, R. Zhang, J. Zhang, Y . Ye, Z. Luo, Z. Feng, and Y . Ma, “Llamafactory: Unified efficient fine-tuning of 100+ language models,”arXiv preprint arXiv:2403.13372, 2024

  74. [74]

    Perception-r1: Pioneering perception policy with reinforcement learning,

    E. Yu, K. Lin, L. Zhao, jisheng yin, Y . Wei, Y . Peng, H. Wei, J. Sun, C. Han, Z. Ge, X. Zhang, D. Jiang, J. Wang, and W. Tao, “Perception-r1: Pioneering perception policy with reinforcement learning,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. [Online]. Available: https: //openreview.net/forum?id=BeXcXrXetA

  75. [75]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    M. Reid, N. Savinov, D. Teplyashin, D. Lepikhin, T. Lil- licrap, J.-b. Alayrac, R. Soricut, A. Lazaridou, O. Firat, J. Schrittwieseret al., “Gemini 1.5: Unlocking multi- modal understanding across millions of tokens of con- IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, MONTH 20XX 14 text,”arXiv preprint arXiv:2403.05530, 2024

  76. [76]

    Qwen3-vl: Sharper vision, deeper thought, broader action,

    Q. Team, “Qwen3-vl: Sharper vision, deeper thought, broader action,”Qwen Blog. Accessed, pp. 10–04, 2025

  77. [77]

    InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

    Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Lu, B. Li, P. Luo, T. Lu, Y . Qiao, and J. Dai, “Internvl: Scaling up vision foun- dation models and aligning for generic visual-linguistic tasks,”arXiv preprint arXiv:2312.14238, 2023