pith. machine review for the scientific record. sign in

arxiv: 2605.10833 · v1 · submitted 2026-05-11 · 💻 cs.CV · cs.AI

Recognition: no theorem link

MMVIAD: Multi-view Multi-task Video Understanding for Industrial Anomaly Detection

Jing Jin, Tao Feng, Xiran Zhao, Xuanyu Zhu, Yan Bai, Yifeng Sun, Yihang Lou, Yingna Wu, Zhongan Wang

Pith reviewed 2026-05-12 05:03 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords multi-view videoindustrial anomaly detectionmulti-task evaluationdefect classificationtemporal localizationvideo MLLMsupervised fine-tuningpolicy optimization
0
0 comments X

The pith

A new multi-view video dataset for industrial inspection combined with structured post-training raises average model performance on anomaly tasks from 45.0 to 57.5 on unseen data and exceeds GPT-5.4.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MMVIAD as the first dataset of continuous multi-view videos for industrial anomaly detection, consisting of short object-centric clips with camera motion across 48 categories, 14 environments, and 6 anomaly types. It supports four tasks: anomaly detection, defect classification, object classification, and anomaly visible-time localization. Existing video models fall short of human performance on fine-grained recognition and temporal grounding. The authors then apply a two-stage pipeline that first uses supervised fine-tuning to build perception-structured reasoning and then applies group relative policy optimization with rewards grounded in defect semantics and visibility timing. This produces the VISTA model, which delivers the reported gains on data the base model has not encountered before.

Core claim

We introduce MMVIAD, the first continuous multi-view video dataset for industrial anomaly detection and understanding, together with a benchmark for multi-task evaluation. Systematic evaluations show that current commercial and open-source video MLLMs remain far below human performance, especially for fine-grained defect recognition and temporal grounding. We develop a two-stage post-training pipeline where PS-SFT initializes perception-structured reasoning and VISTA-GRPO refines the model with semantic-gated defect reward and visibility-aware temporal reward, producing the final model VISTA. On MMVIAD-Unseen, VISTA improves the base model's average score across the four tasks from 45.0 to

What carries the argument

The two-stage post-training pipeline of Perception-Structured Supervised Fine-Tuning (PS-SFT) followed by Visibility-grounded Industrial Structured Temporal Anomaly Group Relative Policy Optimization (VISTA-GRPO), which applies semantic-gated defect rewards and visibility-aware temporal rewards to refine anomaly understanding.

Load-bearing premise

The reported performance gains reflect genuine improvements in anomaly understanding rather than the model fitting to the particular statistics or annotation patterns of the new MMVIAD dataset.

What would settle it

An independent collection of multi-view industrial video clips recorded under similar conditions but outside the MMVIAD distribution, on which VISTA either fails to maintain the 12.5-point average gain or drops back to the base model's 45.0 score, would falsify the claim of transferable improvement.

Figures

Figures reproduced from arXiv: 2605.10833 by Jing Jin, Tao Feng, Xiran Zhao, Xuanyu Zhu, Yan Bai, Yifeng Sun, Yihang Lou, Yingna Wu, Zhongan Wang.

Figure 1
Figure 1. Figure 1: Comparison of anomaly detection paradigms. From single-view and sparse multi-view detection to continuous multi-view video anomaly detection. To address this gap, we introduce MMVIAD (Multi-view Multi-task Video Industrial Anomaly Detection), a visibility-grounded video dataset and benchmark for continuous multi-view, multi-task industrial anomaly understanding. To the best of our knowledge, MMVIAD is the … view at source ↗
Figure 2
Figure 2. Figure 2: MMVIAD data construction. We render continuous multi-view videos from Anomaly￾ShapeNet point clouds, derive visible-time labels by comparing aligned marked and unmarked videos, and convert each annotated clip into four QA tasks for anomaly detection, defect classification, object classification, and visible-time localization [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: VISTA-GRPO reward modeling. The reward design combines format checking, answer correctness, semantic-gated defect reward, and visibility-aware temporal reward. 3.2.1 PS-SFT: Perception-Structured Supervised Fine-Tuning Directly applying reinforcement learning to the base model is unstable, as unreliable structured outputs lead to sparse rewards and hinder early exploration. To mitigate this cold-start prob… view at source ↗
Figure 4
Figure 4. Figure 4: Task-wise reward dynamics during VISTA-GRPO training. Rewards improve most clearly on anomaly detection and anomaly visible-time localization, while defect classification remains sparse and unstable, highlighting the core challenge of fine-grained industrial anomaly understanding. 8 [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: evaluates synthetic-to-real generalization on real industrial image datasets. VISTA trained on synthetic MMVIAD-Unseen improves the Qwen3-VL-8B base model from 43.8 to 67.1 average accuracy on MVTec AD and from 53.0 to 54.4 on VisA. The gain is especially large on MVTec AD, where most categories benefit from the MMVIAD-trained model, suggesting that anomaly understanding learned from synthetic multi-view v… view at source ↗
Figure 6
Figure 6. Figure 6: shows a qualitative example of VISTA. The model separates global object perception from localized defect evidence and grounds the defect prediction to the visible-time interval. This illustrates that MMVIAD evaluates not only final answer correctness, but also evidence-grounded temporal understanding [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Representative semantic object groups in MMVIAD. We visualize representative objects from the 17 semantic groups used in MMVIAD. Objects are rendered with a neutral material to emphasize geometric and semantic diversity. MMVIAD contains 48 fine-grained object categories, which are organized into 17 semantic groups [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Representative material groups in MMVIAD. MMVIAD uses six major material groups, including plastic, fabric, clay, wood, leather, and metal, with 28 fine-grained material subclasses in total. These materials introduce diverse surface appearance and reflectance while preserving the object-centric inspection setting. C.3 Structural Anomaly Types MMVIAD contains six structural anomaly types: crack, scratch, co… view at source ↗
Figure 9
Figure 9. Figure 9: Representative structural anomaly types in MMVIAD. We visualize six structural anomaly types, including crack, scratch, concavity, bulge, broken, and hole. The highlighted boxes indicate defect regions on representative objects. D Reasoning Examples [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Reasoning example: tap with bulge. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Reasoning example: vase with concavity [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Reasoning example: helmet with hole. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Reasoning example: microphone with scratch. These examples show that VISTA does not only output final answers, but also grounds them in explicit visual evidence. The model links object-level cues to category prediction, defect morphology to anomaly type, and viewpoint-dependent visibility to temporal localization. This structured reasoning format makes the four task predictions interpretable and exposes w… view at source ↗
Figure 14
Figure 14. Figure 14: PS-SFT training curves. We report loss, token accuracy, learning rate, and gradient norm during perception-structured supervised fine-tuning [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: shows the VISTA-GRPO training curves, including KL divergence, gradient norm, entropy, and completion length [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗
read the original abstract

Industrial anomaly detection is critical for manufacturing quality control, yet existing datasets mainly focus on static images or sparse views, which do not fully reflect continuous inspection processes in real industrial scenarios. We introduce MMVIAD (Multi-view Multi-task Video Industrial Anomaly Detection), to the best of our knowledge the first continuous multi-view video dataset for industrial anomaly detection and understanding, together with a benchmark for multi-task evaluation. MMVIAD contains object-centric 2-second inspection clips with approximately 120 degrees of camera motion, covering 48 object categories, 14 environments, and 6 structural anomaly types. It supports anomaly detection, defect classification, object classification, and anomaly visible-time localization. Systematic evaluations on MMVIAD show that current commercial and open-source video MLLMs remain far below human performance, especially for fine-grained defect recognition and temporal grounding. To improve transferable anomaly understanding, we further develop a two-stage post-training pipeline where PS-SFT (Perception-Structured Supervised Fine-Tuning) initializes perception-structured reasoning and VISTA-GRPO (Visibility-grounded Industrial Structured Temporal Anomaly Group Relative Policy Optimization) refines the model with semantic-gated defect reward and visibility-aware temporal reward, producing the final model VISTA. On MMVIAD-Unseen, VISTA improves the base model's average score across the four tasks from 45.0 to 57.5, surpassing GPT-5.4. Source code is available at https://github.com/Georgekeepmoving/MMVIAD.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces MMVIAD, a new multi-view multi-task video dataset for industrial anomaly detection consisting of 2-second 120° object-centric clips across 48 categories, 14 environments, and 6 anomaly types. It supports four tasks (anomaly detection, defect classification, object classification, anomaly visible-time localization) and provides a benchmark showing current video MLLMs lag human performance. The authors propose a two-stage post-training pipeline—PS-SFT for perception-structured reasoning followed by VISTA-GRPO with semantic-gated defect and visibility-aware temporal rewards—yielding the VISTA model. On the MMVIAD-Unseen split, VISTA raises the base model's average multi-task score from 45.0 to 57.5 and surpasses GPT-5.4; source code is released.

Significance. If the reported gains reflect genuine generalization rather than dataset-specific overfitting, MMVIAD would be a valuable addition to industrial anomaly detection resources by shifting focus from static images to continuous multi-view video inspection. The two-stage pipeline and reward design offer a concrete approach to improving structured reasoning in video MLLMs for this domain. The public code release supports reproducibility.

major comments (3)
  1. [§4, §5.2] §4 (Dataset Construction) and §5.2 (Evaluation Protocol): The construction of the MMVIAD-Unseen split is not specified—e.g., whether entire object categories, environments, or anomaly types are held out versus a random video-level split. Without this, the claim that the 57.5 average score demonstrates transferable anomaly understanding (rather than exploitation of MMVIAD's particular statistics) cannot be evaluated.
  2. [§5.3] §5.3 (VISTA-GRPO): The semantic-gated defect reward and visibility-aware temporal reward are derived directly from the same human annotations used to create MMVIAD. The paper must demonstrate that these rewards do not introduce leakage or allow the policy to fit annotation artifacts specific to the 48-category, 14-environment collection rather than learning robust perception.
  3. [§5.1, Table 2] §5.1 and Table 2: No information is provided on baseline implementation details, statistical significance testing, or controls for data leakage in the reported numbers (base model 45.0 → VISTA 57.5). This makes it impossible to verify that the improvement over GPT-5.4 is robust.
minor comments (2)
  1. [Abstract, §1] The abstract and §1 cite 'human performance' as an upper bound but provide no details on how human baselines were collected or measured.
  2. [§3, §5] Notation for the four tasks and the exact averaging procedure for the 'average score' should be defined explicitly in §3 or §5.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback, which has helped us identify areas where additional clarity and rigor are needed. We address each major comment below and have revised the manuscript accordingly to strengthen the presentation of the dataset splits, reward design, and evaluation protocols.

read point-by-point responses
  1. Referee: [§4, §5.2] §4 (Dataset Construction) and §5.2 (Evaluation Protocol): The construction of the MMVIAD-Unseen split is not specified—e.g., whether entire object categories, environments, or anomaly types are held out versus a random video-level split. Without this, the claim that the 57.5 average score demonstrates transferable anomaly understanding (rather than exploitation of MMVIAD's particular statistics) cannot be evaluated.

    Authors: We agree that the original description of the MMVIAD-Unseen split in §4 was insufficiently detailed. The split holds out 12 complete object categories (25% of the 48 total) that do not appear in the training or validation sets, while environments and anomaly types have partial overlap but with no video-level leakage from the held-out categories. We have expanded §4 with a new table (Table 1) listing the exact held-out categories, the distribution of environments across splits, and a diagram illustrating the category-level hold-out strategy. This makes explicit that the 57.5 score reflects generalization to unseen objects rather than memorization of dataset-specific statistics. revision: yes

  2. Referee: [§5.3] §5.3 (VISTA-GRPO): The semantic-gated defect reward and visibility-aware temporal reward are derived directly from the same human annotations used to create MMVIAD. The paper must demonstrate that these rewards do not introduce leakage or allow the policy to fit annotation artifacts specific to the 48-category, 14-environment collection rather than learning robust perception.

    Authors: The rewards are computed exclusively from the training-split annotations and never access test or Unseen annotations during GRPO. To address potential fitting to collection-specific artifacts, we have added an ablation study in the revised §5.3 showing that ablating the semantic gate drops Unseen performance by 4.2 points while the visibility-aware temporal reward contributes an additional 3.1 points; both components improve generalization metrics on held-out categories. We have also clarified that the reward functions use only coarse category-level and visibility labels rather than fine-grained per-video artifacts, and we include a new paragraph discussing why this design encourages robust perception rather than overfitting. revision: partial

  3. Referee: [§5.1, Table 2] §5.1 and Table 2: No information is provided on baseline implementation details, statistical significance testing, or controls for data leakage in the reported numbers (base model 45.0 → VISTA 57.5). This makes it impossible to verify that the improvement over GPT-5.4 is robust.

    Authors: We acknowledge the omission of these details. In the revised manuscript we have expanded §5.1 with full prompting templates and decoding parameters used for all baselines (including GPT-5.4), added standard deviations computed over three independent runs for every reported score, and included a paired t-test (p < 0.01) confirming the 12.5-point average improvement is statistically significant. We have also added an explicit statement and appendix section verifying that no MMVIAD-Unseen videos appear in any baseline pre-training corpora or fine-tuning data, thereby ruling out leakage. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results on newly introduced dataset

full rationale

The paper introduces the MMVIAD dataset and reports empirical performance numbers for a two-stage training pipeline (PS-SFT followed by VISTA-GRPO) on the MMVIAD-Unseen split. No equations, self-definitional quantities, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described claims. The central result (score lift from 45.0 to 57.5) is a measured outcome on held-out data rather than a quantity that reduces to its own inputs by construction. The derivation chain is therefore self-contained as standard supervised fine-tuning plus RL on a new benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities are described in the abstract; the central claim rests on the empirical utility of the new dataset and the effectiveness of the proposed training stages.

pith-pipeline@v0.9.0 · 5593 in / 1164 out tokens · 66385 ms · 2026-05-12T05:03:31.769433+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 5 internal anchors

  1. [1]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. volume 35, pages 23716–23736, 2022

  2. [2]

    Localizing moments in video with natural language

    Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. Localizing moments in video with natural language. InProceedings of the IEEE international conference on computer vision, pages 5803–5812, 2017

  3. [3]

    Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023

  4. [4]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  5. [5]

    Efficientad: Accurate visual anomaly detection at millisecond-level latencies

    Kilian Batzner, Lars Heckler, and Rebecca König. Efficientad: Accurate visual anomaly detection at millisecond-level latencies. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 128–138, 2024

  6. [6]

    Mvtec ad–a comprehen- sive real-world dataset for unsupervised anomaly detection

    Paul Bergmann, Michael Fauser, David Sattlegger, and Carsten Steger. Mvtec ad–a comprehen- sive real-world dataset for unsupervised anomaly detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9592–9600, 2019

  7. [7]

    The mvtec 3d-ad dataset for unsupervised 3d anomaly detection and localization

    Paul Bergmann, Xin Jin, David Sattlegger, and Carsten Steger. The mvtec 3d-ad dataset for unsupervised 3d anomaly detection and localization. 2021

  8. [8]

    Anomalyr1: A grpo-based end-to-end mllm for industrial anomaly detection.arXiv preprint arXiv:2504.11914, 2025

    Yuhao Chao, Jie Liu, Jie Tang, and Gangshan Wu. Anomalyr1: A grpo-based end-to-end mllm for industrial anomaly detection.arXiv preprint arXiv:2504.11914, 2025. 10

  9. [9]

    VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

    Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms.arXiv preprint arXiv:2406.07476, 2024

  10. [10]

    Instructblip: Towards general-purpose vision-language models with instruction tuning

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. volume 36, pages 49250–49267, 2023

  11. [11]

    Padim: a patch distribution modeling framework for anomaly detection and localization

    Thomas Defard, Aleksandr Setkov, Angelique Loesch, and Romaric Audigier. Padim: a patch distribution modeling framework for anomaly detection and localization. InInternational conference on pattern recognition, pages 475–489. Springer, 2021

  12. [12]

    Anomaly detection via reverse distillation from one-class embed- ding

    Hanqiu Deng and Xingyu Li. Anomaly detection via reverse distillation from one-class embed- ding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9737–9746, 2022

  13. [13]

    Video-r1: Reinforcing video reasoning in mllms

    Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms. 2025

  14. [14]

    Tall: Temporal activity localization via language query

    Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. Tall: Temporal activity localization via language query. InProceedings of the IEEE international conference on computer vision, pages 5267–5275, 2017

  15. [15]

    Anoma- lygpt: Detecting industrial anomalies using large vision-language models

    Zhaopeng Gu, Bingke Zhu, Guibo Zhu, Yingying Chen, Ming Tang, and Jinqiao Wang. Anoma- lygpt: Detecting industrial anomalies using large vision-language models. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 1932–1940, 2024

  16. [16]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  17. [17]

    Vtg-llm: Integrating timestamp knowledge into video llms for enhanced video temporal grounding

    Yongxin Guo, Jingyu Liu, Mingda Li, Dingxin Cheng, Xiaoying Tang, Dianbo Sui, Qingbin Liu, Xi Chen, and Kevin Zhao. Vtg-llm: Integrating timestamp knowledge into video llms for enhanced video temporal grounding. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 3302–3310, 2025

  18. [18]

    Trace: Temporal grounding video llm via causal event modeling

    Yongxin Guo, Jingyu Liu, Mingda Li, Qingbin Liu, Xi Chen, and Xiaoying Tang. Trace: Temporal grounding video llm via causal event modeling. 2024

  19. [19]

    Deep learning- based defect detection of metal parts: evaluating current methods in complex conditions

    Stepan Jezek, Martin Jonak, Radim Burget, Pavel Dvorak, and Milos Skotak. Deep learning- based defect detection of metal parts: evaluating current methods in complex conditions. In 2021 13th International congress on ultra modern telecommunications and control systems and workshops (ICUMT), pages 66–71. IEEE, 2021

  20. [20]

    Mmad: A comprehensive benchmark for multimodal large language models in industrial anomaly detection

    Xi Jiang, Jian Li, Hanqiu Deng, Yong Liu, Bin-Bin Gao, Yifeng Zhou, Jialin Li, Chengjie Wang, and Feng Zheng. Mmad: A comprehensive benchmark for multimodal large language models in industrial anomaly detection. 2024

  21. [21]

    Detecting moments and highlights in videos via natural language queries

    Jie Lei, Tamara L Berg, and Mohit Bansal. Detecting moments and highlights in videos via natural language queries. volume 34, pages 11846–11858, 2021

  22. [22]

    Cutpaste: Self-supervised learning for anomaly detection and localization

    Chun-Liang Li, Kihyuk Sohn, Jinsung Yoon, and Tomas Pfister. Cutpaste: Self-supervised learning for anomaly detection and localization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9664–9674, 2021

  23. [23]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023

  24. [24]

    Towards visual discrimination and reasoning of real-world physical dynamics: Physics-grounded anomaly detection

    Wenqiao Li, Yao Gu, Xintao Chen, Xiaohao Xu, Ming Hu, Xiaonan Huang, and Yingna Wu. Towards visual discrimination and reasoning of real-world physical dynamics: Physics-grounded anomaly detection. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 30409–30419, 2025. 11

  25. [25]

    Towards scalable 3d anomaly detection and localization: A benchmark via 3d anomaly synthesis and a self-supervised learning network

    Wenqiao Li, Xiaohao Xu, Yao Gu, Bozhong Zheng, Shenghua Gao, and Yingna Wu. Towards scalable 3d anomaly detection and localization: A benchmark via 3d anomaly synthesis and a self-supervised learning network. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22207–22216, 2024

  26. [26]

    arXiv preprint arXiv:2504.06958 (2025)

    Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via reinforce- ment fine-tuning.arXiv preprint arXiv:2504.06958, 2025

  27. [27]

    Llama-vid: An image is worth 2 tokens in large language models

    Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. InEuropean Conference on Computer Vision, pages 323–340. Springer, 2024

  28. [28]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. volume 36, pages 34892–34916, 2023

  29. [29]

    Video-chatgpt: Towards detailed video understanding via large vision and language models

    Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12585–12602, 2024

  30. [30]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. volume 35, pages 27730–27744, 2022

  31. [31]

    Chatvtg: Video temporal grounding via chat with video dialogue large language models

    Mengxue Qu, Xiaodong Chen, Wu Liu, Alicia Li, and Yao Zhao. Chatvtg: Video temporal grounding via chat with video dialogue large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1847–1856, 2024

  32. [32]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. volume 36, pages 53728–53741, 2023

  33. [33]

    Timechat: A time-sensitive multimodal large language model for long video understanding

    Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large language model for long video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14313–14323, 2024

  34. [34]

    Towards total recall in industrial anomaly detection

    Karsten Roth, Latha Pemula, Joaquin Zepeda, Bernhard Schölkopf, Thomas Brox, and Peter Gehler. Towards total recall in industrial anomaly detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14318–14328, 2022

  35. [35]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  36. [36]

    Real-iad: A real-world multi-view dataset for benchmarking versatile industrial anomaly detection

    Chengjie Wang, Wenbing Zhu, Bin-Bin Gao, Zhenye Gan, Jiangning Zhang, Zhihao Gu, Shuguang Qian, Mingang Chen, and Lizhuang Ma. Real-iad: A real-world multi-view dataset for benchmarking versatile industrial anomaly detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22883–22892, 2024

  37. [37]

    Time-r1: Post-training large vision language model for temporal video grounding

    Ye Wang, Ziheng Wang, Boshen Xu, Yang Du, Kejun Lin, Zihan Xiao, Zihao Yue, Jianzhong Ju, Liang Zhang, Dingyi Yang, et al. Time-r1: Post-training large vision language model for temporal video grounding. 2025

  38. [38]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. volume 35, pages 24824–24837, 2022

  39. [39]

    Videochat-r1

    Ziang Yan, Xinhao Li, Yinan He, Zhengrong Yue, Xiangyu Zeng, Yali Wang, Yu Qiao, Limin Wang, and Yi Wang. Videochat-r1. 5: Visual test-time scaling to reinforce multimodal reasoning by iterative perception. 2025

  40. [40]

    Draem-a discriminatively trained re- construction embedding for surface anomaly detection

    Vitjan Zavrtanik, Matej Kristan, and Danijel Sko ˇcaj. Draem-a discriminatively trained re- construction embedding for surface anomaly detection. InProceedings of the IEEE/CVF international conference on computer vision, pages 8330–8339, 2021. 12

  41. [41]

    Video-llama: An instruction-tuned audio-visual language model for video understanding

    Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding. InProceedings of the 2023 conference on empirical methods in natural language processing: system demonstrations, pages 543–553, 2023

  42. [42]

    Long Context Transfer from Language to Vision

    Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision. arXiv preprint arXiv:2406.16852, 2024

  43. [43]

    Learning 2d temporal adjacent networks for moment localization with natural language

    Songyang Zhang, Houwen Peng, Jianlong Fu, and Jiebo Luo. Learning 2d temporal adjacent networks for moment localization with natural language. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 12870–12877, 2020

  44. [44]

    Rad: A dataset and benchmark for real-life anomaly detection with robotic observations.arXiv preprint arXiv:2410.00713, 2024

    Kaichen Zhou, Xinhai Chang, Taewhan Kim, Jiadong Zhang, Yang Cao, Chufei Peng, Fangneng Zhan, Hao Zhao, Hao Dong, Kai Ming Ting, et al. Rad: A dataset and benchmark for real-life anomaly detection with robotic observations.arXiv preprint arXiv:2410.00713, 2024

  45. [45]

    Pad: A dataset and benchmark for pose-agnostic anomaly detection.Advances in Neural Information Processing Systems, 36:44558–44571, 2023

    Qiang Zhou, Weize Li, Lihan Jiang, Guoliang Wang, Guyue Zhou, Shanghang Zhang, and Hao Zhao. Pad: A dataset and benchmark for pose-agnostic anomaly detection.Advances in Neural Information Processing Systems, 36:44558–44571, 2023

  46. [46]

    Minigpt-4: Enhanc- ing vision-language understanding with advanced large language models

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhanc- ing vision-language understanding with advanced large language models. 2023

  47. [47]

    w/oRsg” removes the semantic gate for defect classification, and “w/o Rvis

    Yang Zou, Jongheon Jeong, Latha Pemula, Dongqing Zhang, and Onkar Dabeer. Spot-the- difference self-supervised pre-training for anomaly detection and segmentation. InEuropean conference on computer vision, pages 392–408. Springer, 2022. 13 A Broader Impact A.1 Potential Positive Impacts Industrial anomaly detection is important for manufacturing quality c...