arxiv: 2605.10833 · v1 · submitted 2026-05-11 · 💻 cs.CV · cs.AI

Recognition: no theorem link

MMVIAD: Multi-view Multi-task Video Understanding for Industrial Anomaly Detection

Jing Jin, Tao Feng, Xiran Zhao, Xuanyu Zhu, Yan Bai, Yifeng Sun, Yihang Lou, Yingna Wu, Zhongan Wang

Pith reviewed 2026-05-12 05:03 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords multi-view videoindustrial anomaly detectionmulti-task evaluationdefect classificationtemporal localizationvideo MLLMsupervised fine-tuningpolicy optimization

0 comments

The pith

A new multi-view video dataset for industrial inspection combined with structured post-training raises average model performance on anomaly tasks from 45.0 to 57.5 on unseen data and exceeds GPT-5.4.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MMVIAD as the first dataset of continuous multi-view videos for industrial anomaly detection, consisting of short object-centric clips with camera motion across 48 categories, 14 environments, and 6 anomaly types. It supports four tasks: anomaly detection, defect classification, object classification, and anomaly visible-time localization. Existing video models fall short of human performance on fine-grained recognition and temporal grounding. The authors then apply a two-stage pipeline that first uses supervised fine-tuning to build perception-structured reasoning and then applies group relative policy optimization with rewards grounded in defect semantics and visibility timing. This produces the VISTA model, which delivers the reported gains on data the base model has not encountered before.

Core claim

We introduce MMVIAD, the first continuous multi-view video dataset for industrial anomaly detection and understanding, together with a benchmark for multi-task evaluation. Systematic evaluations show that current commercial and open-source video MLLMs remain far below human performance, especially for fine-grained defect recognition and temporal grounding. We develop a two-stage post-training pipeline where PS-SFT initializes perception-structured reasoning and VISTA-GRPO refines the model with semantic-gated defect reward and visibility-aware temporal reward, producing the final model VISTA. On MMVIAD-Unseen, VISTA improves the base model's average score across the four tasks from 45.0 to

What carries the argument

The two-stage post-training pipeline of Perception-Structured Supervised Fine-Tuning (PS-SFT) followed by Visibility-grounded Industrial Structured Temporal Anomaly Group Relative Policy Optimization (VISTA-GRPO), which applies semantic-gated defect rewards and visibility-aware temporal rewards to refine anomaly understanding.

Load-bearing premise

The reported performance gains reflect genuine improvements in anomaly understanding rather than the model fitting to the particular statistics or annotation patterns of the new MMVIAD dataset.

What would settle it

An independent collection of multi-view industrial video clips recorded under similar conditions but outside the MMVIAD distribution, on which VISTA either fails to maintain the 12.5-point average gain or drops back to the base model's 45.0 score, would falsify the claim of transferable improvement.

Figures

Figures reproduced from arXiv: 2605.10833 by Jing Jin, Tao Feng, Xiran Zhao, Xuanyu Zhu, Yan Bai, Yifeng Sun, Yihang Lou, Yingna Wu, Zhongan Wang.

**Figure 1.** Figure 1: Comparison of anomaly detection paradigms. From single-view and sparse multi-view detection to continuous multi-view video anomaly detection. To address this gap, we introduce MMVIAD (Multi-view Multi-task Video Industrial Anomaly Detection), a visibility-grounded video dataset and benchmark for continuous multi-view, multi-task industrial anomaly understanding. To the best of our knowledge, MMVIAD is the … view at source ↗

**Figure 2.** Figure 2: MMVIAD data construction. We render continuous multi-view videos from AnomalyShapeNet point clouds, derive visible-time labels by comparing aligned marked and unmarked videos, and convert each annotated clip into four QA tasks for anomaly detection, defect classification, object classification, and visible-time localization [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: VISTA-GRPO reward modeling. The reward design combines format checking, answer correctness, semantic-gated defect reward, and visibility-aware temporal reward. 3.2.1 PS-SFT: Perception-Structured Supervised Fine-Tuning Directly applying reinforcement learning to the base model is unstable, as unreliable structured outputs lead to sparse rewards and hinder early exploration. To mitigate this cold-start prob… view at source ↗

**Figure 4.** Figure 4: Task-wise reward dynamics during VISTA-GRPO training. Rewards improve most clearly on anomaly detection and anomaly visible-time localization, while defect classification remains sparse and unstable, highlighting the core challenge of fine-grained industrial anomaly understanding. 8 [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: evaluates synthetic-to-real generalization on real industrial image datasets. VISTA trained on synthetic MMVIAD-Unseen improves the Qwen3-VL-8B base model from 43.8 to 67.1 average accuracy on MVTec AD and from 53.0 to 54.4 on VisA. The gain is especially large on MVTec AD, where most categories benefit from the MMVIAD-trained model, suggesting that anomaly understanding learned from synthetic multi-view v… view at source ↗

**Figure 6.** Figure 6: shows a qualitative example of VISTA. The model separates global object perception from localized defect evidence and grounds the defect prediction to the visible-time interval. This illustrates that MMVIAD evaluates not only final answer correctness, but also evidence-grounded temporal understanding [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Representative semantic object groups in MMVIAD. We visualize representative objects from the 17 semantic groups used in MMVIAD. Objects are rendered with a neutral material to emphasize geometric and semantic diversity. MMVIAD contains 48 fine-grained object categories, which are organized into 17 semantic groups [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Representative material groups in MMVIAD. MMVIAD uses six major material groups, including plastic, fabric, clay, wood, leather, and metal, with 28 fine-grained material subclasses in total. These materials introduce diverse surface appearance and reflectance while preserving the object-centric inspection setting. C.3 Structural Anomaly Types MMVIAD contains six structural anomaly types: crack, scratch, co… view at source ↗

**Figure 9.** Figure 9: Representative structural anomaly types in MMVIAD. We visualize six structural anomaly types, including crack, scratch, concavity, bulge, broken, and hole. The highlighted boxes indicate defect regions on representative objects. D Reasoning Examples [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Reasoning example: tap with bulge. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Reasoning example: vase with concavity [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: Reasoning example: helmet with hole. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

**Figure 13.** Figure 13: Reasoning example: microphone with scratch. These examples show that VISTA does not only output final answers, but also grounds them in explicit visual evidence. The model links object-level cues to category prediction, defect morphology to anomaly type, and viewpoint-dependent visibility to temporal localization. This structured reasoning format makes the four task predictions interpretable and exposes w… view at source ↗

**Figure 14.** Figure 14: PS-SFT training curves. We report loss, token accuracy, learning rate, and gradient norm during perception-structured supervised fine-tuning [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗

**Figure 15.** Figure 15: shows the VISTA-GRPO training curves, including KL divergence, gradient norm, entropy, and completion length [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗

read the original abstract

Industrial anomaly detection is critical for manufacturing quality control, yet existing datasets mainly focus on static images or sparse views, which do not fully reflect continuous inspection processes in real industrial scenarios. We introduce MMVIAD (Multi-view Multi-task Video Industrial Anomaly Detection), to the best of our knowledge the first continuous multi-view video dataset for industrial anomaly detection and understanding, together with a benchmark for multi-task evaluation. MMVIAD contains object-centric 2-second inspection clips with approximately 120 degrees of camera motion, covering 48 object categories, 14 environments, and 6 structural anomaly types. It supports anomaly detection, defect classification, object classification, and anomaly visible-time localization. Systematic evaluations on MMVIAD show that current commercial and open-source video MLLMs remain far below human performance, especially for fine-grained defect recognition and temporal grounding. To improve transferable anomaly understanding, we further develop a two-stage post-training pipeline where PS-SFT (Perception-Structured Supervised Fine-Tuning) initializes perception-structured reasoning and VISTA-GRPO (Visibility-grounded Industrial Structured Temporal Anomaly Group Relative Policy Optimization) refines the model with semantic-gated defect reward and visibility-aware temporal reward, producing the final model VISTA. On MMVIAD-Unseen, VISTA improves the base model's average score across the four tasks from 45.0 to 57.5, surpassing GPT-5.4. Source code is available at https://github.com/Georgekeepmoving/MMVIAD.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a practical new multi-view video dataset for industrial anomaly detection plus a two-stage fine-tuning recipe that reports clear gains on their benchmark, but the unseen split leaves the generalization claim open to doubt.

read the letter

This paper introduces MMVIAD, a dataset of 2-second multi-view video clips covering 48 object categories, 14 environments, and 6 anomaly types, along with a benchmark for four tasks: anomaly detection, defect classification, object classification, and temporal localization. It also describes VISTA, a two-stage post-training approach that first does perception-structured supervised fine-tuning and then applies visibility-grounded GRPO with semantic and temporal rewards. On their MMVIAD-Unseen split the method raises average scores from 45.0 to 57.5 and beats GPT-5.4, while showing that existing video MLLMs still lag human performance on fine-grained defect work. Code release helps anyone who wants to reproduce or extend it.

Referee Report

3 major / 2 minor

Summary. The paper introduces MMVIAD, a new multi-view multi-task video dataset for industrial anomaly detection consisting of 2-second 120° object-centric clips across 48 categories, 14 environments, and 6 anomaly types. It supports four tasks (anomaly detection, defect classification, object classification, anomaly visible-time localization) and provides a benchmark showing current video MLLMs lag human performance. The authors propose a two-stage post-training pipeline—PS-SFT for perception-structured reasoning followed by VISTA-GRPO with semantic-gated defect and visibility-aware temporal rewards—yielding the VISTA model. On the MMVIAD-Unseen split, VISTA raises the base model's average multi-task score from 45.0 to 57.5 and surpasses GPT-5.4; source code is released.

Significance. If the reported gains reflect genuine generalization rather than dataset-specific overfitting, MMVIAD would be a valuable addition to industrial anomaly detection resources by shifting focus from static images to continuous multi-view video inspection. The two-stage pipeline and reward design offer a concrete approach to improving structured reasoning in video MLLMs for this domain. The public code release supports reproducibility.

major comments (3)

[§4, §5.2] §4 (Dataset Construction) and §5.2 (Evaluation Protocol): The construction of the MMVIAD-Unseen split is not specified—e.g., whether entire object categories, environments, or anomaly types are held out versus a random video-level split. Without this, the claim that the 57.5 average score demonstrates transferable anomaly understanding (rather than exploitation of MMVIAD's particular statistics) cannot be evaluated.
[§5.3] §5.3 (VISTA-GRPO): The semantic-gated defect reward and visibility-aware temporal reward are derived directly from the same human annotations used to create MMVIAD. The paper must demonstrate that these rewards do not introduce leakage or allow the policy to fit annotation artifacts specific to the 48-category, 14-environment collection rather than learning robust perception.
[§5.1, Table 2] §5.1 and Table 2: No information is provided on baseline implementation details, statistical significance testing, or controls for data leakage in the reported numbers (base model 45.0 → VISTA 57.5). This makes it impossible to verify that the improvement over GPT-5.4 is robust.

minor comments (2)

[Abstract, §1] The abstract and §1 cite 'human performance' as an upper bound but provide no details on how human baselines were collected or measured.
[§3, §5] Notation for the four tasks and the exact averaging procedure for the 'average score' should be defined explicitly in §3 or §5.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback, which has helped us identify areas where additional clarity and rigor are needed. We address each major comment below and have revised the manuscript accordingly to strengthen the presentation of the dataset splits, reward design, and evaluation protocols.

read point-by-point responses

Referee: [§4, §5.2] §4 (Dataset Construction) and §5.2 (Evaluation Protocol): The construction of the MMVIAD-Unseen split is not specified—e.g., whether entire object categories, environments, or anomaly types are held out versus a random video-level split. Without this, the claim that the 57.5 average score demonstrates transferable anomaly understanding (rather than exploitation of MMVIAD's particular statistics) cannot be evaluated.

Authors: We agree that the original description of the MMVIAD-Unseen split in §4 was insufficiently detailed. The split holds out 12 complete object categories (25% of the 48 total) that do not appear in the training or validation sets, while environments and anomaly types have partial overlap but with no video-level leakage from the held-out categories. We have expanded §4 with a new table (Table 1) listing the exact held-out categories, the distribution of environments across splits, and a diagram illustrating the category-level hold-out strategy. This makes explicit that the 57.5 score reflects generalization to unseen objects rather than memorization of dataset-specific statistics. revision: yes
Referee: [§5.3] §5.3 (VISTA-GRPO): The semantic-gated defect reward and visibility-aware temporal reward are derived directly from the same human annotations used to create MMVIAD. The paper must demonstrate that these rewards do not introduce leakage or allow the policy to fit annotation artifacts specific to the 48-category, 14-environment collection rather than learning robust perception.

Authors: The rewards are computed exclusively from the training-split annotations and never access test or Unseen annotations during GRPO. To address potential fitting to collection-specific artifacts, we have added an ablation study in the revised §5.3 showing that ablating the semantic gate drops Unseen performance by 4.2 points while the visibility-aware temporal reward contributes an additional 3.1 points; both components improve generalization metrics on held-out categories. We have also clarified that the reward functions use only coarse category-level and visibility labels rather than fine-grained per-video artifacts, and we include a new paragraph discussing why this design encourages robust perception rather than overfitting. revision: partial
Referee: [§5.1, Table 2] §5.1 and Table 2: No information is provided on baseline implementation details, statistical significance testing, or controls for data leakage in the reported numbers (base model 45.0 → VISTA 57.5). This makes it impossible to verify that the improvement over GPT-5.4 is robust.

Authors: We acknowledge the omission of these details. In the revised manuscript we have expanded §5.1 with full prompting templates and decoding parameters used for all baselines (including GPT-5.4), added standard deviations computed over three independent runs for every reported score, and included a paired t-test (p < 0.01) confirming the 12.5-point average improvement is statistically significant. We have also added an explicit statement and appendix section verifying that no MMVIAD-Unseen videos appear in any baseline pre-training corpora or fine-tuning data, thereby ruling out leakage. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results on newly introduced dataset

full rationale

The paper introduces the MMVIAD dataset and reports empirical performance numbers for a two-stage training pipeline (PS-SFT followed by VISTA-GRPO) on the MMVIAD-Unseen split. No equations, self-definitional quantities, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described claims. The central result (score lift from 45.0 to 57.5) is a measured outcome on held-out data rather than a quantity that reduces to its own inputs by construction. The derivation chain is therefore self-contained as standard supervised fine-tuning plus RL on a new benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities are described in the abstract; the central claim rests on the empirical utility of the new dataset and the effectiveness of the proposed training stages.

pith-pipeline@v0.9.0 · 5593 in / 1164 out tokens · 66385 ms · 2026-05-12T05:03:31.769433+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 5 internal anchors

[1]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. volume 35, pages 23716–23736, 2022

work page 2022
[2]

Localizing moments in video with natural language

Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. Localizing moments in video with natural language. InProceedings of the IEEE international conference on computer vision, pages 5803–5812, 2017

work page 2017
[3]

Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023

work page 2023
[4]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Efficientad: Accurate visual anomaly detection at millisecond-level latencies

Kilian Batzner, Lars Heckler, and Rebecca König. Efficientad: Accurate visual anomaly detection at millisecond-level latencies. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 128–138, 2024

work page 2024
[6]

Mvtec ad–a comprehen- sive real-world dataset for unsupervised anomaly detection

Paul Bergmann, Michael Fauser, David Sattlegger, and Carsten Steger. Mvtec ad–a comprehen- sive real-world dataset for unsupervised anomaly detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9592–9600, 2019

work page 2019
[7]

The mvtec 3d-ad dataset for unsupervised 3d anomaly detection and localization

Paul Bergmann, Xin Jin, David Sattlegger, and Carsten Steger. The mvtec 3d-ad dataset for unsupervised 3d anomaly detection and localization. 2021

work page 2021
[8]

Anomalyr1: A grpo-based end-to-end mllm for industrial anomaly detection.arXiv preprint arXiv:2504.11914, 2025

Yuhao Chao, Jie Liu, Jie Tang, and Gangshan Wu. Anomalyr1: A grpo-based end-to-end mllm for industrial anomaly detection.arXiv preprint arXiv:2504.11914, 2025. 10

work page arXiv 2025
[9]

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms.arXiv preprint arXiv:2406.07476, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Instructblip: Towards general-purpose vision-language models with instruction tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. volume 36, pages 49250–49267, 2023

work page 2023
[11]

Padim: a patch distribution modeling framework for anomaly detection and localization

Thomas Defard, Aleksandr Setkov, Angelique Loesch, and Romaric Audigier. Padim: a patch distribution modeling framework for anomaly detection and localization. InInternational conference on pattern recognition, pages 475–489. Springer, 2021

work page 2021
[12]

Anomaly detection via reverse distillation from one-class embed- ding

Hanqiu Deng and Xingyu Li. Anomaly detection via reverse distillation from one-class embed- ding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9737–9746, 2022

work page 2022
[13]

Video-r1: Reinforcing video reasoning in mllms

Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms. 2025

work page 2025
[14]

Tall: Temporal activity localization via language query

Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. Tall: Temporal activity localization via language query. InProceedings of the IEEE international conference on computer vision, pages 5267–5275, 2017

work page 2017
[15]

Anoma- lygpt: Detecting industrial anomalies using large vision-language models

Zhaopeng Gu, Bingke Zhu, Guibo Zhu, Yingying Chen, Ming Tang, and Jinqiao Wang. Anoma- lygpt: Detecting industrial anomalies using large vision-language models. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 1932–1940, 2024

work page 1932
[16]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Vtg-llm: Integrating timestamp knowledge into video llms for enhanced video temporal grounding

Yongxin Guo, Jingyu Liu, Mingda Li, Dingxin Cheng, Xiaoying Tang, Dianbo Sui, Qingbin Liu, Xi Chen, and Kevin Zhao. Vtg-llm: Integrating timestamp knowledge into video llms for enhanced video temporal grounding. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 3302–3310, 2025

work page 2025
[18]

Trace: Temporal grounding video llm via causal event modeling

Yongxin Guo, Jingyu Liu, Mingda Li, Qingbin Liu, Xi Chen, and Xiaoying Tang. Trace: Temporal grounding video llm via causal event modeling. 2024

work page 2024
[19]

Deep learning- based defect detection of metal parts: evaluating current methods in complex conditions

Stepan Jezek, Martin Jonak, Radim Burget, Pavel Dvorak, and Milos Skotak. Deep learning- based defect detection of metal parts: evaluating current methods in complex conditions. In 2021 13th International congress on ultra modern telecommunications and control systems and workshops (ICUMT), pages 66–71. IEEE, 2021

work page 2021
[20]

Mmad: A comprehensive benchmark for multimodal large language models in industrial anomaly detection

Xi Jiang, Jian Li, Hanqiu Deng, Yong Liu, Bin-Bin Gao, Yifeng Zhou, Jialin Li, Chengjie Wang, and Feng Zheng. Mmad: A comprehensive benchmark for multimodal large language models in industrial anomaly detection. 2024

work page 2024
[21]

Detecting moments and highlights in videos via natural language queries

Jie Lei, Tamara L Berg, and Mohit Bansal. Detecting moments and highlights in videos via natural language queries. volume 34, pages 11846–11858, 2021

work page 2021
[22]

Cutpaste: Self-supervised learning for anomaly detection and localization

Chun-Liang Li, Kihyuk Sohn, Jinsung Yoon, and Tomas Pfister. Cutpaste: Self-supervised learning for anomaly detection and localization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9664–9674, 2021

work page 2021
[23]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023

work page 2023
[24]

Towards visual discrimination and reasoning of real-world physical dynamics: Physics-grounded anomaly detection

Wenqiao Li, Yao Gu, Xintao Chen, Xiaohao Xu, Ming Hu, Xiaonan Huang, and Yingna Wu. Towards visual discrimination and reasoning of real-world physical dynamics: Physics-grounded anomaly detection. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 30409–30419, 2025. 11

work page 2025
[25]

Towards scalable 3d anomaly detection and localization: A benchmark via 3d anomaly synthesis and a self-supervised learning network

Wenqiao Li, Xiaohao Xu, Yao Gu, Bozhong Zheng, Shenghua Gao, and Yingna Wu. Towards scalable 3d anomaly detection and localization: A benchmark via 3d anomaly synthesis and a self-supervised learning network. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22207–22216, 2024

work page 2024
[26]

arXiv preprint arXiv:2504.06958 (2025)

Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via reinforce- ment fine-tuning.arXiv preprint arXiv:2504.06958, 2025

work page arXiv 2025
[27]

Llama-vid: An image is worth 2 tokens in large language models

Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. InEuropean Conference on Computer Vision, pages 323–340. Springer, 2024

work page 2024
[28]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. volume 36, pages 34892–34916, 2023

work page 2023
[29]

Video-chatgpt: Towards detailed video understanding via large vision and language models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12585–12602, 2024

work page 2024
[30]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. volume 35, pages 27730–27744, 2022

work page 2022
[31]

Chatvtg: Video temporal grounding via chat with video dialogue large language models

Mengxue Qu, Xiaodong Chen, Wu Liu, Alicia Li, and Yao Zhao. Chatvtg: Video temporal grounding via chat with video dialogue large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1847–1856, 2024

work page 2024
[32]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. volume 36, pages 53728–53741, 2023

work page 2023
[33]

Timechat: A time-sensitive multimodal large language model for long video understanding

Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large language model for long video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14313–14323, 2024

work page 2024
[34]

Towards total recall in industrial anomaly detection

Karsten Roth, Latha Pemula, Joaquin Zepeda, Bernhard Schölkopf, Thomas Brox, and Peter Gehler. Towards total recall in industrial anomaly detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14318–14328, 2022

work page 2022
[35]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Real-iad: A real-world multi-view dataset for benchmarking versatile industrial anomaly detection

Chengjie Wang, Wenbing Zhu, Bin-Bin Gao, Zhenye Gan, Jiangning Zhang, Zhihao Gu, Shuguang Qian, Mingang Chen, and Lizhuang Ma. Real-iad: A real-world multi-view dataset for benchmarking versatile industrial anomaly detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22883–22892, 2024

work page 2024
[37]

Time-r1: Post-training large vision language model for temporal video grounding

Ye Wang, Ziheng Wang, Boshen Xu, Yang Du, Kejun Lin, Zihan Xiao, Zihao Yue, Jianzhong Ju, Liang Zhang, Dingyi Yang, et al. Time-r1: Post-training large vision language model for temporal video grounding. 2025

work page 2025
[38]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. volume 35, pages 24824–24837, 2022

work page 2022
[39]

Videochat-r1

Ziang Yan, Xinhao Li, Yinan He, Zhengrong Yue, Xiangyu Zeng, Yali Wang, Yu Qiao, Limin Wang, and Yi Wang. Videochat-r1. 5: Visual test-time scaling to reinforce multimodal reasoning by iterative perception. 2025

work page 2025
[40]

Draem-a discriminatively trained re- construction embedding for surface anomaly detection

Vitjan Zavrtanik, Matej Kristan, and Danijel Sko ˇcaj. Draem-a discriminatively trained re- construction embedding for surface anomaly detection. InProceedings of the IEEE/CVF international conference on computer vision, pages 8330–8339, 2021. 12

work page 2021
[41]

Video-llama: An instruction-tuned audio-visual language model for video understanding

Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding. InProceedings of the 2023 conference on empirical methods in natural language processing: system demonstrations, pages 543–553, 2023

work page 2023
[42]

Long Context Transfer from Language to Vision

Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision. arXiv preprint arXiv:2406.16852, 2024

work page internal anchor Pith review arXiv 2024
[43]

Learning 2d temporal adjacent networks for moment localization with natural language

Songyang Zhang, Houwen Peng, Jianlong Fu, and Jiebo Luo. Learning 2d temporal adjacent networks for moment localization with natural language. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 12870–12877, 2020

work page 2020
[44]

Rad: A dataset and benchmark for real-life anomaly detection with robotic observations.arXiv preprint arXiv:2410.00713, 2024

Kaichen Zhou, Xinhai Chang, Taewhan Kim, Jiadong Zhang, Yang Cao, Chufei Peng, Fangneng Zhan, Hao Zhao, Hao Dong, Kai Ming Ting, et al. Rad: A dataset and benchmark for real-life anomaly detection with robotic observations.arXiv preprint arXiv:2410.00713, 2024

work page arXiv 2024
[45]

Pad: A dataset and benchmark for pose-agnostic anomaly detection.Advances in Neural Information Processing Systems, 36:44558–44571, 2023

Qiang Zhou, Weize Li, Lihan Jiang, Guoliang Wang, Guyue Zhou, Shanghang Zhang, and Hao Zhao. Pad: A dataset and benchmark for pose-agnostic anomaly detection.Advances in Neural Information Processing Systems, 36:44558–44571, 2023

work page 2023
[46]

Minigpt-4: Enhanc- ing vision-language understanding with advanced large language models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhanc- ing vision-language understanding with advanced large language models. 2023

work page 2023
[47]

w/oRsg” removes the semantic gate for defect classification, and “w/o Rvis

Yang Zou, Jongheon Jeong, Latha Pemula, Dongqing Zhang, and Onkar Dabeer. Spot-the- difference self-supervised pre-training for anomaly detection and segmentation. InEuropean conference on computer vision, pages 392–408. Springer, 2022. 13 A Broader Impact A.1 Potential Positive Impacts Industrial anomaly detection is important for manufacturing quality c...

work page 2022