arxiv: 2603.13779 · v2 · submitted 2026-03-14 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

AD-Copilot: A Vision-Language Assistant for Industrial Anomaly Detection via Visual In-context Comparison

Xi Jiang , Yue Guo , Jian Li , Yong Liu , Bin-Bin Gao , Hanqiu Deng , Jun Liu , Heng Zhao

show 2 more authors

Chengjie Wang Feng Zheng

Authors on Pith no claims yet

Pith reviewed 2026-05-15 11:50 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords industrial anomaly detectionvision-language modelmultimodal LLMdefect localizationin-context comparisonMMAD benchmark

0 comments

The pith

AD-Copilot detects industrial anomalies by comparing image pairs visually rather than through language alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AD-Copilot as a vision-language model built for industrial anomaly detection. General multimodal models struggle here because they are trained on web data and compare images only in language space, missing subtle visual differences. AD-Copilot adds a Comparison Encoder that applies cross-attention directly to paired image features and trains it in stages on a new dataset called Chat-AD, which is curated from sparsely labeled industrial images for captioning, visual question answering, and defect localization. This produces 82.3 percent accuracy on the MMAD benchmark, up to 3.35 times better localization on the new MMAD-BBox test, and performance that exceeds human experts on several tasks while generalizing to other benchmarks.

Core claim

AD-Copilot achieves 82.3 percent accuracy on the MMAD benchmark and a maximum 3.35 times improvement on MMAD-BBox by incorporating a Comparison Encoder that uses cross-attention between paired image features to enable fine-grained visual comparison, trained through a multi-stage process on the Chat-AD dataset mined from sparsely labeled industrial images for captioning, VQA, and defect localization.

What carries the argument

Comparison Encoder that applies cross-attention between features of paired images to support multi-image fine-grained perception during anomaly detection.

Load-bearing premise

The data curation pipeline extracts accurate inspection knowledge from sparsely labeled industrial images and produces high-quality training samples without introducing major label noise or domain mismatch.

What would settle it

A controlled experiment that replaces the cross-attention Comparison Encoder with independent image encoding and measures whether accuracy on subtle defect pairs falls back to baseline levels would test whether the visual comparison mechanism drives the reported gains.

Figures

Figures reproduced from arXiv: 2603.13779 by Bin-Bin Gao, Chengjie Wang, Feng Zheng, Hanqiu Deng, Heng Zhao, Jian Li, Jun Liu, Xi Jiang, Yong Liu, Yue Guo.

**Figure 1.** Figure 1: AD-Copilot was trained on meticulously curated datasets, including high-quality synthetic IAD caption and VQA data, together with general-domain [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: The pipeline for Chat-AD synthesis. Each query image is paired with a normal template image, and the generated text explicitly describes visual [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: The model architecture of AD-Copilot. The Comparison Encoder performs cross-attention between paired image features to generate comparison [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Task and metric of MMAD-BBox. VI. MMAD-BBOX BENCHMARK Fine localization of anomalies is crucial in practice, as it helps distinguish defects and provides actionable visualization [17]. However, previous MLLMs lack the grounding ability, and earlier benchmarks [8, 14] did not support this task. Recent MLLMs can already output bounding box coordinates as text, which highlights the gap between model capabili… view at source ↗

**Figure 5.** Figure 5: Data scaling behavior of AD-Copilot. Performance improves with [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Visualization of the attention ratio between base tokens and compar [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative evaluation of AD-Copilot compared with other models across various IAD tasks. [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

read the original abstract

Multimodal Large Language Models (MLLMs) have achieved impressive success in natural visual understanding, yet they consistently underperform in industrial anomaly detection (IAD). This is because MLLMs trained mostly on general web data differ significantly from industrial images. Moreover, they encode each image independently and can only compare images in the language space, making them insensitive to subtle visual differences that are key to IAD. To tackle these issues, we present AD-Copilot, an interactive MLLM specialized for IAD via visual in-context comparison. We first design a novel data curation pipeline to mine inspection knowledge from sparsely labeled industrial images and generate precise samples for captioning, VQA, and defect localization, yielding a large-scale multimodal dataset Chat-AD rich in semantic signals for IAD. On this foundation, AD-Copilot incorporates a novel Comparison Encoder that employs cross-attention between paired image features to enhance multi-image fine-grained perception, and is trained with a multi-stage strategy that incorporates domain knowledge and gradually enhances IAD skills. In addition, we introduce MMAD-BBox, an extended benchmark for anomaly localization with bounding-box-based evaluation. The experiments show that AD-Copilot achieves 82.3% accuracy on the MMAD benchmark, outperforming all other models without any data leakage. In the MMAD-BBox test, it achieves a maximum improvement of $3.35\times$ over the baseline. AD-Copilot also exhibits excellent generalization of its performance gains across other specialized and general-purpose benchmarks. Remarkably, AD-Copilot surpasses human expert-level performance on several IAD tasks, demonstrating its potential as a reliable assistant for real-world industrial inspection. All datasets and models will be released for the broader benefit of the community.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AD-Copilot adds a cross-attention Comparison Encoder and Chat-AD dataset for industrial anomaly detection, with reported gains that rest on unverified data quality.

read the letter

The main thing to know is that this paper builds a specialized MLLM for spotting defects in factory images by adding visual in-context comparison instead of relying on language alone. The Comparison Encoder runs cross-attention across paired image features to catch subtle differences, and they created Chat-AD by running a curation pipeline over sparsely labeled industrial data to produce captioning, VQA, and localization samples. They also extended the benchmark to MMAD-BBox with bounding-box evaluation and trained the model in stages to fold in domain knowledge gradually. The reported numbers are 82.3% accuracy on MMAD, up to 3.35 times better localization than baseline, solid generalization to other benchmarks, and better-than-human results on several tasks, all claimed without data leakage. That architecture and dataset construction are the actual new pieces; the rest is mostly careful fine-tuning of existing MLLMs. The soft spot is the data pipeline. The abstract says it yields precise samples, but there are no numbers on label noise, agreement rates, or domain-shift checks. If the mining step adds even moderate errors, the cross-attention and training stages would be learning from noisy signals, which would weaken the generalization and no-leakage claims. More ablations and error bars would help. This paper is for researchers working on applied computer vision in manufacturing or quality control. A reader building tools for industrial inspection would get concrete ideas from the encoder design and the new benchmark. It deserves peer review so the data quality and experimental details can be checked properly.

Referee Report

2 major / 2 minor

Summary. The paper introduces AD-Copilot, a specialized multimodal large language model for industrial anomaly detection (IAD) that performs visual in-context comparison. It proposes a novel data curation pipeline to extract inspection knowledge from sparsely labeled industrial images and generate the Chat-AD dataset for captioning, VQA, and defect localization tasks. The architecture includes a Comparison Encoder using cross-attention between paired image features, trained via a multi-stage strategy incorporating domain knowledge. A new MMAD-BBox benchmark for bounding-box anomaly localization is introduced. Reported results include 82.3% accuracy on MMAD (outperforming baselines without data leakage), up to 3.35× improvement on MMAD-BBox, generalization across other benchmarks, and surpassing human expert performance on several IAD tasks, with plans to release datasets and models.

Significance. If the empirical claims hold after verification of data quality and no leakage, the work would advance the application of MLLMs to specialized industrial domains by addressing the domain gap and the limitation of language-only comparison for subtle visual differences. The release of Chat-AD and MMAD-BBox would provide reusable resources for the community. The multi-stage training and visual cross-attention mechanism represent a targeted adaptation that could inform future domain-specific MLLM development.

major comments (2)

[Data Curation Pipeline] The headline results (82.3% MMAD accuracy, 3.35× MMAD-BBox gain, human-expert surpassing) depend entirely on the data curation pipeline producing high-quality samples without significant label noise or domain mismatch. No quantitative validation (noise rate, inter-annotator agreement, or domain-shift metrics) is reported for the mined Chat-AD data, leaving the foundation of all gains unverified.
[Experiments] The claim of 'without any data leakage' for the 82.3% MMAD accuracy and generalization results is load-bearing but unsupported by details on train/test split construction or leakage verification for the mined industrial images.

minor comments (2)

[Abstract] The abstract asserts 'excellent generalization' across benchmarks but omits specific accuracy numbers or tables summarizing those results, reducing clarity for readers.
[Method] Notation for the Comparison Encoder's cross-attention (e.g., how paired features are fused before the language model) could be formalized with an equation to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment below and will revise the manuscript to provide the requested clarifications and additional validation details.

read point-by-point responses

Referee: [Data Curation Pipeline] The headline results (82.3% MMAD accuracy, 3.35× MMAD-BBox gain, human-expert surpassing) depend entirely on the data curation pipeline producing high-quality samples without significant label noise or domain mismatch. No quantitative validation (noise rate, inter-annotator agreement, or domain-shift metrics) is reported for the mined Chat-AD data, leaving the foundation of all gains unverified.

Authors: We agree that explicit quantitative validation strengthens the claims. The manuscript describes the curation pipeline but does not report noise rates, inter-annotator agreement, or domain-shift metrics. In the revision we will add these: (i) noise rate estimates from manual inspection of 500 randomly sampled instances, (ii) inter-annotator agreement on a subset of VQA and localization annotations, and (iii) domain-shift metrics (e.g., Fréchet distance on CLIP features) between Chat-AD and the MMAD test distribution. revision: yes
Referee: [Experiments] The claim of 'without any data leakage' for the 82.3% MMAD accuracy and generalization results is load-bearing but unsupported by details on train/test split construction or leakage verification for the mined industrial images.

Authors: We acknowledge that the current text states the no-leakage claim without sufficient procedural detail. The splits were constructed by source-level partitioning and hash-based deduplication to ensure no image or caption overlap. In the revision we will add a dedicated paragraph (and appendix table) that specifies the exact partitioning rules, deduplication method, and verification steps performed to confirm zero leakage between Chat-AD training data and all reported test sets. revision: yes

Circularity Check

0 steps flagged

No circularity: results rest on external empirical benchmarks

full rationale

The paper presents an empirical system: a data-curation pipeline that produces the Chat-AD dataset, a Comparison Encoder using cross-attention, and multi-stage training. All headline numbers (82.3 % MMAD accuracy, 3.35× MMAD-BBox gain, human-expert surpassing) are obtained by evaluating the trained model on held-out external benchmarks (MMAD, MMAD-BBox, and others) with an explicit “no data leakage” claim. No equation, parameter, or prediction is defined in terms of itself; the derivation chain consists of standard supervised training followed by independent test-set measurement. Therefore the central claims do not reduce to tautology or self-citation load-bearing.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on standard machine learning assumptions about transfer learning from general MLLMs and the utility of cross-attention for multi-image comparison; no new physical entities or ad-hoc constants are introduced beyond typical training hyperparameters.

free parameters (1)

Multi-stage training hyperparameters
The multi-stage training strategy incorporates domain knowledge and likely requires tuned learning rates, loss weights, and stage durations fitted during development.

axioms (1)

domain assumption Base MLLMs can acquire fine-grained industrial anomaly detection skills through supervised fine-tuning on curated multimodal data
Invoked when stating that the model is trained with a multi-stage strategy to enhance IAD skills.

pith-pipeline@v0.9.0 · 5643 in / 1407 out tokens · 52037 ms · 2026-05-15T11:50:39.508880+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We design a novel Comparison Encoder that employs cross-attention between paired image features to generate a compact set of learnable comparison tokens
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

multi-stage training curriculum that progressively builds comparison capability from general visual differences to industrial anomaly detection

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

IAD-Unify: A Region-Grounded Unified Model for Industrial Anomaly Segmentation, Understanding, and Generation
cs.CV 2026-04 unverdicted novelty 7.0

IAD-Unify unifies industrial anomaly segmentation, region-grounded language understanding, and mask-guided generation in one framework using DINOv2 token injection into Qwen3.5, supported by the new Anomaly-56K datase...

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · cited by 1 Pith paper · 12 internal anchors

[1]

On domain-adaptive post-training for multimodal large language models,

D. Cheng, S. Huang, Z. Zhu, X. Zhang, W. X. Zhao, Z. Luan, B. Dai, and Z. Zhang, “On domain-adaptive post-training for multimodal large language models,” arXiv preprint arXiv:2411.19930, 2024

work page arXiv 2024
[2]

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

W. Xu, H. P. Chan, L. Li, M. Aljunied, R. Yuan, J. Wang, C. Xiao, G. Chen, C. Liu, Z. Liet al., “Lingshu: A generalist foundation model for unified multimodal medical understanding and reasoning,”arXiv preprint arXiv:2506.07044, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Qwen2.5-VL Technical Report

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tanget al., “Qwen2. 5-vl technical report,”arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei, Z. Liu, L. Jing, S. Ye, J. Shaoet al., “Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency,”arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Seed1.5-VL Technical Report

D. Guo, F. Wu, F. Zhu, F. Leng, G. Shi, H. Chen, H. Fan, J. Wang, J. Jiang, J. Wanget al., “Seed1. 5-vl technical report,”arXiv preprint arXiv:2505.07062, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

A survey of visual sensory anomaly detection,

X. Jiang, G. Xie, J. Wang, Y . Liu, C. Wang, F. Zheng, and Y . Jin, “A survey of visual sensory anomaly detection,” arXiv preprint arXiv:2202.07006, 2022

work page arXiv 2022
[7]

Softpatch+: Fully unsupervised anomaly classification and segmentation,

C. Wang, X. Jiang, B.-B. Gao, Z. Gan, Y . Liu, F. Zheng, and L. Ma, “Softpatch+: Fully unsupervised anomaly classification and segmentation,”Pattern Recognition, vol. 161, p. 111295, 2025

work page 2025
[8]

Mmad: A comprehensive benchmark for multimodal large language models in industrial anomaly detection,

X. Jiang, J. Li, H. Deng, Y . Liu, B.-B. Gao, Y . Zhou, J. Li, C. Wang, and F. Zheng, “Mmad: A comprehensive benchmark for multimodal large language models in industrial anomaly detection,” inThe Thirteenth Inter- national Conference on Learning Representations, 2025

work page 2025
[9]

Efficient multi- modal large language models: A survey,

Y . Jin, J. Li, Y . Liu, T. Gu, K. Wu, Z. Jiang, M. He, B. Zhao, X. Tan, Z. Ganet al., “Efficient multi- modal large language models: A survey,”arXiv preprint arXiv:2405.10739, 2024

work page arXiv 2024
[10]

Ader: A comprehensive benchmark for multi-class visual anomaly detection,

J. Zhang, H. He, Z. Gan, Q. He, Y . Cai, Z. Xue, Y . Wang, C. Wang, L. Xie, and Y . Liu, “Ader: A comprehensive benchmark for multi-class visual anomaly detection,” arXiv preprint arXiv:2406.03262, vol. 2, no. 3, 2024

work page arXiv 2024
[11]

Can multimodal large language models be guided to im- prove industrial anomaly detection?

Z. Chen, H. Chen, M. Imani, and F. Imani, “Can multimodal large language models be guided to im- prove industrial anomaly detection?”arXiv preprint arXiv:2501.15795, 2025

work page arXiv 2025
[12]

Promptad: Learning prompts with only normal samples for few-shot anomaly detection,

X. Li, Z. Zhang, X. Tan, C. Chen, Y . Qu, Y . Xie, and L. Ma, “Promptad: Learning prompts with only normal samples for few-shot anomaly detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 16 838– 16 848

work page 2024
[13]

Manta: A large-scale multi-view and visual-text anomaly detection dataset for tiny ob- jects,

L. Fan, D. Fan, Z. Hu, Y . Ding, D. Di, K. Yi, M. Pag- nucco, and Y . Song, “Manta: A large-scale multi-view and visual-text anomaly detection dataset for tiny ob- jects,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 25 518–25 527

work page 2025
[14]

Towards zero-shot anomaly detection and reasoning with multimodal large language models,

J. Xu, S.-Y . Lo, B. Safaei, V . M. Patel, and I. Dwivedi, “Towards zero-shot anomaly detection and reasoning with multimodal large language models,” inProceedings of the Computer Vision and Pattern Recognition Confer- ence, 2025, pp. 20 370–20 382

work page 2025
[15]

Triad: Empowering lmm-based anomaly detection with expert-guided region-of-interest tokenizer and manufacturing process,

Y . Li, S. Yuan, H. Wang, Q. Li, M. Liu, C. Xu, G. Shi, and W. Zuo, “Triad: Empowering lmm-based anomaly detection with expert-guided region-of-interest tokenizer and manufacturing process,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 21 917–21 926

work page 2025
[16]

Anomalygpt: Detecting industrial anomalies using large vision-language models,

Z. Gu, B. Zhu, G. Zhu, Y . Chen, M. Tang, and J. Wang, “Anomalygpt: Detecting industrial anomalies using large vision-language models,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, 2024, pp. 1932–1940

work page 2024
[17]

Eiad: Explainable industrial anomaly detection via multi-modal large language models,

Z. Zhang, J. Ruan, X. Gao, T. Liu, and Y . Fu, “Eiad: Explainable industrial anomaly detection via multi-modal large language models,”arXiv preprint arXiv:2503.14162, 2025

work page arXiv 2025
[18]

Mvtec ad–a comprehensive real-world dataset for un- supervised anomaly detection,

P. Bergmann, M. Fauser, D. Sattlegger, and C. Steger, “Mvtec ad–a comprehensive real-world dataset for un- supervised anomaly detection,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 9592–9600

work page 2019
[19]

Spot-the-difference self-supervised pre-training for anomaly detection and segmentation,

Y . Zou, J. Jeong, L. Pemula, D. Zhang, and O. Dabeer, “Spot-the-difference self-supervised pre-training for anomaly detection and segmentation,” inEuropean Con- ference on Computer Vision. Springer, 2022, pp. 392– 408

work page 2022
[20]

OpenAI o1 System Card

OpenAI, A. Jaech, A. Kalai, A. Lerer, A. Richardson, A. El-Kishky, A. Low, A. Helyar, A. Madry, A. Beutel, A. Carneyet al., “Openai o1 system card,”arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning,

DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi et al., “Deepseek-r1 incentivizes reasoning in llms through reinforcement learning,”Nature, vol. 645, 633–638, 2025. [Online]. Available: https://doi.org/10. 1038/s41586-025-09422-z

work page 2025
[22]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wuet al., “Deepseekmath: Pushing the limits of mathematical reasoning in open lan- guage models,”arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Anomalyr1: A grpo- IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, MONTH 20XX 12 based end-to-end mllm for industrial anomaly detection,

Y . Chao, J. Liu, J. Tang, and G. Wu, “Anomalyr1: A grpo- IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, MONTH 20XX 12 based end-to-end mllm for industrial anomaly detection,” arXiv preprint arXiv:2504.11914, 2025

work page arXiv 2025
[24]

Omniad: Detect and understand industrial anomaly via multimodal reasoning,

S. Zhao, Y . Lin, L. Han, Y . Zhao, and Y . Wei, “Omniad: Detect and understand industrial anomaly via multimodal reasoning,”arXiv preprint arXiv:2505.22039, 2025

work page arXiv 2025
[25]

Emit: Enhancing mllms for industrial anomaly detection via difficulty-aware grpo,

W. Guan, J. Lan, J. Cao, H. Tan, H. Zhu, and W. Wang, “Emit: Enhancing mllms for industrial anomaly detection via difficulty-aware grpo,”arXiv preprint arXiv:2507.21619, 2025

work page arXiv 2025
[26]

Ad-fm: Multimodal llms for anomaly detection via multi-stage reasoning and fine-grained reward optimization,

J. Liao, Y . Su, R.-C. Tu, Z. Jin, W. Sun, Y . Li, D. Tao, X. Xu, and X. Yang, “Ad-fm: Multimodal llms for anomaly detection via multi-stage reasoning and fine-grained reward optimization,”arXiv preprint arXiv:2508.04175, 2025

work page arXiv 2025
[27]

End-to-end object detection with transformers,

N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kir- illov, and S. Zagoruyko, “End-to-end object detection with transformers,” inEuropean conference on computer vision. Springer, 2020, pp. 213–229

work page 2020
[28]

Accurate industrial anomaly detection and localization using weakly-supervised residual transformers,

H. Li, J. Wu, D. Liu, L. Y . Wu, H. Chen, and C. Shen, “Accurate industrial anomaly detection and localization using weakly-supervised residual transformers,”IEEE Transactions on Image Processing, 2026

work page 2026
[29]

Unipcb: A unified vision-language benchmark for open-ended pcb quality inspection,

F. Sun, X. Jiang, J. Wuet al., “Unipcb: A unified vision-language benchmark for open-ended pcb quality inspection,”arXiv preprint arXiv:2601.19222, 2026

work page arXiv 2026
[30]

Omnidiff: A comprehensive benchmark for fine-grained image difference captioning,

Y . Liu, S. Hou, S. Hou, J. Du, S. Meng, and Y . Huang, “Omnidiff: A comprehensive benchmark for fine-grained image difference captioning,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 21 440–21 449

work page 2025
[31]

Softpatch: Unsupervised anomaly detection with noisy data,

X. Jiang, J. Liu, J. Wang, Q. Nie, K. Wu, Y . Liu, C. Wang, and F. Zheng, “Softpatch: Unsupervised anomaly detection with noisy data,”Advances in Neural Information Processing Systems, vol. 35, pp. 15 433– 15 445, 2022

work page 2022
[32]

Anomaly detection via reverse distillation from one-class embedding,

H. Deng and X. Li, “Anomaly detection via reverse distillation from one-class embedding,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 9737–9746

work page 2022
[33]

Efficientad: Accurate visual anomaly detection at millisecond-level latencies,

K. Batzner, L. Heckler, and R. K ¨onig, “Efficientad: Accurate visual anomaly detection at millisecond-level latencies,” inProceedings of the IEEE/CVF winter con- ference on applications of computer vision, 2024, pp. 128–138

work page 2024
[34]

Focus- patch ad: Few-shot multi-class anomaly detection with unified keywords patch prompts,

X. Ding, X. Li, M. Chen, J. Gong, and Y . Xie, “Focus- patch ad: Few-shot multi-class anomaly detection with unified keywords patch prompts,”IEEE Transactions on Image Processing, vol. 35, pp. 112–123, 2025

work page 2025
[35]

A unified model for multi-class anomaly detection,

Z. You, L. Cui, Y . Shen, K. Yang, X. Lu, Y . Zheng, and X. Le, “A unified model for multi-class anomaly detection,”Advances in Neural Information Processing Systems, vol. 35, pp. 4571–4584, 2022

work page 2022
[36]

Toward multi-class anomaly detection: Exploring class-aware unified model against inter-class interference,

X. Jiang, Y . Chen, Q. Nie, J. Liu, Y . Liu, C. Wang, and F. Zheng, “Toward multi-class anomaly detection: Exploring class-aware unified model against inter-class interference,”arXiv preprint arXiv:2403.14213, 2024

work page arXiv 2024
[37]

Normal- abnormal guided generalist anomaly detection,

Y . Wang, X. Wang, Y . Gong, and J. Xiao, “Normal- abnormal guided generalist anomaly detection,”arXiv preprint arXiv:2510.00495, 2025

work page arXiv 2025
[38]

Seas: few-shot industrial anomaly image generation with separation and sharing fine-tuning,

Z. Dai, S. Zeng, H. Liu, X. Li, F. Xue, and Y . Zhou, “Seas: few-shot industrial anomaly image generation with separation and sharing fine-tuning,”arXiv preprint arXiv:2410.14987, 2024

work page arXiv 2024
[39]

Musc: Zero-shot industrial anomaly classification and segmentation with mutual scoring of the unlabeled images,

X. Li, Z. Huang, F. Xue, and Y . Zhou, “Musc: Zero-shot industrial anomaly classification and segmentation with mutual scoring of the unlabeled images,” inThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[40]

Winclip: Zero-/few-shot anomaly classifica- tion and segmentation,

J. Jeong, Y . Zou, T. Kim, D. Zhang, A. Ravichandran, and O. Dabeer, “Winclip: Zero-/few-shot anomaly classifica- tion and segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19 606–19 616

work page 2023
[41]

Adaclip: Adapting clip with hybrid learnable prompts for zero-shot anomaly detection,

Y . Cao, J. Zhang, L. Frittoli, Y . Cheng, W. Shen, and G. Boracchi, “Adaclip: Adapting clip with hybrid learnable prompts for zero-shot anomaly detection,” in European Conference on Computer Vision. Springer, 2024, pp. 55–72

work page 2024
[42]

Anoma- lyclip: Object-agnostic prompt learning for zero-shot anomaly detection,

Q. Zhou, G. Pang, Y . Tian, S. He, and J. Chen, “Anoma- lyclip: Object-agnostic prompt learning for zero-shot anomaly detection,”arXiv preprint arXiv:2310.18961, 2023

work page arXiv 2023
[43]

Multiads: Defect-aware supervision for multi- type anomaly detection and segmentation in zero-shot learning,

Y . Sadikaj, H. Zhou, L. Halilaj, S. Schmid, S. Staab, and C. Plant, “Multiads: Defect-aware supervision for multi- type anomaly detection and segmentation in zero-shot learning,”arXiv preprint arXiv:2504.06740, 2025

work page arXiv 2025
[44]

Towards generic anomaly detection and understanding: Large- scale visual-linguistic model (gpt-4v) takes the lead,

Y . Cao, X. Xu, C. Sun, X. Huang, and W. Shen, “Towards generic anomaly detection and understanding: Large- scale visual-linguistic model (gpt-4v) takes the lead,” arXiv preprint arXiv:2311.02782, 2023

work page arXiv 2023
[45]

Cus- tomizing visual-language foundation models for multi- modal anomaly detection and reasoning,

X. Xu, Y . Cao, H. Zhang, N. Sang, and X. Huang, “Cus- tomizing visual-language foundation models for multi- modal anomaly detection and reasoning,” in2025 28th International Conference on Computer Supported Coop- erative Work in Design (CSCWD). IEEE, 2025, pp. 1443–1448

work page 2025
[46]

Do llms understand visual anomalies? uncovering llm’s capabil- ities in zero-shot anomaly detection,

J. Zhu, S. Cai, F. Deng, B. C. Ooi, and J. Wu, “Do llms understand visual anomalies? uncovering llm’s capabil- ities in zero-shot anomaly detection,” inProceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 48–57

work page 2024
[47]

Gpt-4v-ad: Exploring grounding potential of vqa-oriented gpt-4v for zero-shot anomaly detection,

J. Zhang, H. He, X. Chen, Z. Xue, Y . Wang, C. Wang, L. Xie, and Y . Liu, “Gpt-4v-ad: Exploring grounding potential of vqa-oriented gpt-4v for zero-shot anomaly detection,” inInternational Joint Conference on Artificial Intelligence. Springer, 2024, pp. 3–16

work page 2024
[48]

Detect, classify, act: Categorizing industrial anomalies with multi-modal large language models,

S. Mokhtar, A. Mousakhan, S. Galesso, J. Tayyub, and T. Brox, “Detect, classify, act: Categorizing industrial anomalies with multi-modal large language models,” in Proceedings of the Computer Vision and Pattern Recog- nition Conference, 2025, pp. 4058–4067

work page 2025
[49]

Iad-r1: Reinforcing consistent reasoning in industrial anomaly detection,

Y . Li, Y . Cao, C. Liu, Y . Xiong, X. Dong, and C. Huang, “Iad-r1: Reinforcing consistent reasoning in industrial anomaly detection,”arXiv preprint arXiv:2508.09178, 2025

work page arXiv 2025
[50]

Lr-iad: Mask- IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, MONTH 20XX 13 free industrial anomaly detection with logical reasoning,

P. Zeng, F. Pang, Z. Wang, and A. Yang, “Lr-iad: Mask- IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, MONTH 20XX 13 free industrial anomaly detection with logical reasoning,” arXiv preprint arXiv:2504.19524, 2025

work page arXiv 2025
[51]

A transformer- based siamese network for change detection,

W. G. C. Bandara and V . M. Patel, “A transformer- based siamese network for change detection,” inIEEE In- ternational Geoscience and Remote Sensing Symposium (IGARSS), 2022, pp. 207–210

work page 2022
[52]

Robust change captioning,

D. H. Park, T. Darrell, and A. Rohrbach, “Robust change captioning,” inProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, 2019, pp. 4624– 4633

work page 2019
[53]

Learning to de- scribe differences between pairs of similar images,

H. Jhamtani and T. Berg-Kirkpatrick, “Learning to de- scribe differences between pairs of similar images,” in Proceedings of the 2018 Conference on Empirical Meth- ods in Natural Language Processing (EMNLP), 2018

work page 2018
[54]

Changes to captions: An attentive network for remote sensing change captioning,

S. Chang and P. Ghamisi, “Changes to captions: An attentive network for remote sensing change captioning,” IEEE Transactions on Image Processing, vol. 32, pp. 6047–6060, 2023

work page 2023
[55]

Mag- icbrush: A manually annotated dataset for instruction- guided image editing,

K. Zhang, L. Mo, W. Chen, H. Sun, and Y . Su, “Mag- icbrush: A manually annotated dataset for instruction- guided image editing,” inAdvances in Neural Informa- tion Processing Systems, 2023

work page 2023
[56]

Onediff: A generalist model for image difference cap- tioning,

E. Hu, L. Guo, T. Yue, Z. Zhao, S. Xue, and J. Liu, “Onediff: A generalist model for image difference cap- tioning,” inProceedings of the Asian Conference on Computer Vision (ACCV), 2024, pp. 2439–2455

work page 2024
[57]

Img-diff: Contrastive data synthesis for multimodal large language models,

Q. Jiao, D. Chen, Y . Huang, B. Ding, Y . Li, and Y . Shen, “Img-diff: Contrastive data synthesis for multimodal large language models,”arXiv preprint arXiv:2408.04594, 2024

work page arXiv 2024
[58]

Coft-ad: Contrastive fine-tuning for few-shot anomaly detection,

J. Liao, X. Xu, M. C. Nguyen, A. Goodge, and C. S. Foo, “Coft-ad: Contrastive fine-tuning for few-shot anomaly detection,”IEEE Transactions on Image Processing, vol. 33, pp. 2090–2103, 2024

work page 2090
[59]

A multi-category anomaly edit- ing network with correlation exploration and voxel-level attention for unsupervised surface anomaly detection,

R. Zhang and H.-M. Hu, “A multi-category anomaly edit- ing network with correlation exploration and voxel-level attention for unsupervised surface anomaly detection,” IEEE Transactions on Image Processing, 2025

work page 2025
[60]

Ad3: Introducing a score for anomaly detection dataset difficulty assessment using viaduct dataset,

J. Lehr, J. Philipps, A. Sargsyan, M. Pape, and J. Kr ¨uger, “Ad3: Introducing a score for anomaly detection dataset difficulty assessment using viaduct dataset,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 449–464

work page 2024
[61]

Real-iad: A real-world multi-view dataset for benchmarking versatile industrial anomaly detection,

C. Wang, W. Zhu, B.-B. Gao, Z. Gan, J. Zhang, Z. Gu, S. Qian, M. Chen, and L. Ma, “Real-iad: A real-world multi-view dataset for benchmarking versatile industrial anomaly detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 22 883–22 892

work page 2024
[62]

3cad: A large-scale real-world 3c product dataset for unsupervised anomaly detection,

E. Yang, P. Xing, H. Sun, W. Guo, Y . Ma, Z. Li, and D. Zeng, “3cad: A large-scale real-world 3c product dataset for unsupervised anomaly detection,” inProceed- ings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 9, 2025, pp. 9175–9183

work page 2025
[63]

Deep learning-based defect detection of metal parts: evaluating current methods in complex conditions,

S. Jezek, M. Jonak, R. Burget, P. Dvorak, and M. Sko- tak, “Deep learning-based defect detection of metal parts: evaluating current methods in complex conditions,” in2021 13th International congress on ultra modern telecommunications and control systems and workshops (ICUMT). IEEE, 2021, pp. 66–71

work page 2021
[64]

Beyond dents and scratches: Logical constraints in unsupervised anomaly detection and local- ization,

P. Bergmann, K. Batzner, M. Fauser, D. Sattlegger, and C. Steger, “Beyond dents and scratches: Logical constraints in unsupervised anomaly detection and local- ization,”International Journal of Computer Vision, vol. 130, no. 4, pp. 947–969, 2022

work page 2022
[65]

Pku-goodsad: A supermarket goods dataset for unsupervised anomaly detection and segmentation,

J. Zhang, R. Ding, M. Ban, and L. Dai, “Pku-goodsad: A supermarket goods dataset for unsupervised anomaly detection and segmentation,”IEEE Robotics and Automa- tion Letters, 2024

work page 2024
[66]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[67]

Llava-med: Training a large language-and-vision assistant for biomedicine in one day,

C. Li, C. Wong, S. Zhang, N. Usuyama, H. Liu, J. Yang, T. Naumann, H. Poon, and J. Gao, “Llava-med: Training a large language-and-vision assistant for biomedicine in one day,”Advances in Neural Information Processing Systems, vol. 36, pp. 28 541–28 564, 2023

work page 2023
[68]

2 OLMo 2 Furious

T. OLMo, P. Walsh, L. Soldaini, D. Groeneveld, K. Lo, S. Arora, A. Bhagia, Y . Gu, S. Huang, M. Jordanet al., “2 olmo 2 furious,”arXiv preprint arXiv:2501.00656, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[69]

Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models,

M. Deitke, C. Clark, S. Lee, R. Tripathi, Y . Yang, J. S. Park, M. Salehi, N. Muennighoff, K. Lo, L. Soldaini et al., “Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models,” inProceed- ings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 91–104

work page 2025
[70]

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

F. Li, R. Zhang, H. Zhang, Y . Zhang, B. Li, W. Li, Z. Ma, and C. Li, “Llava-next-interleave: Tackling multi- image, video, and 3d in large multimodal models,”arXiv preprint arXiv:2407.07895, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[71]

SFT memorizes, RL generalizes: A comparative study of foundation model post-training,

T. Chu, Y . Zhai, J. Yang, S. Tong, S. Xie, D. Schuurmans, Q. V . Le, S. Levine, and Y . Ma, “SFT memorizes, RL generalizes: A comparative study of foundation model post-training,” inForty-second International Conference on Machine Learning, 2025. [Online]. Available: https://openreview.net/forum?id=dYur3yabMj

work page 2025
[72]

Detect anything via next point prediction,

Q. Jiang, J. Huo, X. Chen, Y . Xiong, Z. Zeng, Y . Chen, T. Ren, J. Yu, and L. Zhang, “Detect anything via next point prediction,”arXiv preprint arXiv:2510.12798, 2025

work page arXiv 2025
[73]

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

Y . Zheng, R. Zhang, J. Zhang, Y . Ye, Z. Luo, Z. Feng, and Y . Ma, “Llamafactory: Unified efficient fine-tuning of 100+ language models,”arXiv preprint arXiv:2403.13372, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[74]

Perception-r1: Pioneering perception policy with reinforcement learning,

E. Yu, K. Lin, L. Zhao, jisheng yin, Y . Wei, Y . Peng, H. Wei, J. Sun, C. Han, Z. Ge, X. Zhang, D. Jiang, J. Wang, and W. Tao, “Perception-r1: Pioneering perception policy with reinforcement learning,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. [Online]. Available: https: //openreview.net/forum?id=BeXcXrXetA

work page 2025
[75]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

M. Reid, N. Savinov, D. Teplyashin, D. Lepikhin, T. Lil- licrap, J.-b. Alayrac, R. Soricut, A. Lazaridou, O. Firat, J. Schrittwieseret al., “Gemini 1.5: Unlocking multi- modal understanding across millions of tokens of con- IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, MONTH 20XX 14 text,”arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[76]

Qwen3-vl: Sharper vision, deeper thought, broader action,

Q. Team, “Qwen3-vl: Sharper vision, deeper thought, broader action,”Qwen Blog. Accessed, pp. 10–04, 2025

work page 2025
[77]

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Lu, B. Li, P. Luo, T. Lu, Y . Qiao, and J. Dai, “Internvl: Scaling up vision foun- dation models and aligning for generic visual-linguistic tasks,”arXiv preprint arXiv:2312.14238, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023