Are Multimodal LLMs Ready for Surveillance? A Reality Check on Zero-Shot Anomaly Detection in the Wild

Armin Danesh Pazho; Hamed Tabkhi; Narges Rashvand; Shanle Yao

arxiv: 2603.04727 · v2 · pith:X6T6AVYOnew · submitted 2026-03-05 · 💻 cs.CV · cs.AI

Are Multimodal LLMs Ready for Surveillance? A Reality Check on Zero-Shot Anomaly Detection in the Wild

Shanle Yao , Armin Danesh Pazho , Narges Rashvand , Hamed Tabkhi This is my paper

Pith reviewed 2026-05-21 12:18 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords multimodal LLMsvideo anomaly detectionzero-shot learningsurveillanceprecision-recall tradeoffprompt engineering

0 comments

The pith

Multimodal LLMs exhibit conservative bias in zero-shot video anomaly detection, favoring normal labels and collapsing recall.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates state-of-the-art multimodal LLMs on surveillance video anomaly detection by recasting the task as prompt-based binary classification under weak temporal supervision. It reveals that zero-shot inference produces high confidence but strongly favors the normal class, yielding high precision while recall drops sharply and limits usefulness. Class-specific instructions can move the decision boundary and lift peak F1 on ShanghaiTech from 0.09 to 0.64, yet recall stays the main constraint. A reader would care because real surveillance requires catching rare events reliably, and this work maps where current general-purpose models fall short in noisy conditions.

Core claim

Multimodal large language models can treat video anomaly detection as a language-guided binary classification task, but in zero-shot settings they display a pronounced conservative bias that favors the normal class, producing high precision at the cost of recall collapse. Class-specific instructions shift this boundary and raise the peak F1-score on ShanghaiTech from 0.09 to 0.64, although recall remains a critical bottleneck. The results point to a sizable performance gap for MLLMs in the noisy, open-world environments of actual surveillance.

What carries the argument

Reformulating video anomaly detection as a prompt-driven binary classification task inside multimodal LLMs, using varying prompt specificity and 1-3 second temporal windows to expose and adjust the normal-versus-anomalous decision boundary.

If this is right

Class-specific instructions can shift the decision boundary and raise peak F1-scores substantially on standard benchmarks.
Recall remains the dominant practical limitation even after prompt adjustments.
MLLMs exhibit a clear performance gap when faced with the noise typical of open-world surveillance.
Recall-oriented prompting and model calibration are required to close the gap for real deployments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same bias may appear in other rare-event detection settings such as medical video or industrial monitoring where missing positives is costly.
Hybrid pipelines that combine MLLM reasoning with traditional reconstruction cues could address the recall shortfall without full retraining.
Investigating the training data distribution that produces default normal predictions could guide more balanced pre-training for safety-critical uses.

Load-bearing premise

The assumption that results on the ShanghaiTech and CHAD benchmarks under weak temporal supervision and chosen prompt styles generalize to the noisy, open-world conditions of actual surveillance deployments.

What would settle it

Running the same models on additional uncurated real-time surveillance streams containing varied noise levels and anomalies, then checking whether recall stays near zero or rises measurably.

Figures

Figures reproduced from arXiv: 2603.04727 by Armin Danesh Pazho, Hamed Tabkhi, Narges Rashvand, Shanle Yao.

**Figure 1.** Figure 1: Conceputal Overview increasingly rely on automated monitoring, detection systems must not only “see motion,” but also reason about intent, context, and risk. In principle, MLLMs are well-positioned for this: their semantic priors and language interface could enable more interpretable detection decisions, richer explanations, and scalable annotation, capabilities that classical reconstruction, or pose-based… view at source ↗

**Figure 2.** Figure 2: System architecture for prompt-based video anomaly detection. The workflow illustrates the transformation of raw input [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

read the original abstract

Multimodal large language models (MLLMs) have demonstrated impressive general competence in video understanding, yet their reliability for real-world Video Anomaly Detection (VAD) remains largely unexplored. Unlike conventional pipelines relying on reconstruction or pose-based cues, MLLMs enable a paradigm shift: treating anomaly detection as a language-guided reasoning task. In this work, we systematically evaluate state-of-the-art MLLMs on the ShanghaiTech and CHAD benchmarks by reformulating VAD as a binary classification task under weak temporal supervision. We investigate how prompt specificity and temporal window lengths (1s--3s) influence performance, focusing on the precision--recall trade-off. Our findings reveal a pronounced conservative bias in zero-shot settings; while models exhibit high confidence, they disproportionately favor the 'normal' class, resulting in high precision but a recall collapse that limits practical utility. We demonstrate that class-specific instructions can significantly shift this decision boundary, improving the peak F1-score on ShanghaiTech from 0.09 to 0.64, yet recall remains a critical bottleneck. These results highlight a significant performance gap for MLLMs in noisy environments and provide a foundation for future work in recall-oriented prompting and model calibration for open-world surveillance, which demands complex video understanding and reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper shows current MLLMs lean heavily toward calling everything normal in zero-shot VAD on ShanghaiTech and CHAD, with class-specific prompts lifting F1 from 0.09 to 0.64 but leaving recall as the main blocker.

read the letter

The main takeaway is that off-the-shelf MLLMs are not yet reliable for zero-shot anomaly detection in video surveillance. They show a clear conservative bias, defaulting to the normal class with high precision but very low recall, which makes them impractical for catching rare events without heavy prompt engineering. The work tests this on ShanghaiTech and CHAD under weak temporal supervision with 1-3 second windows and tracks how prompt wording shifts the decision boundary. That empirical pattern is the useful part here. It gives concrete numbers on how much class-specific instructions can move the F1 score and flags recall as the persistent gap. The paper does a straightforward job of reformulating VAD as binary classification and measuring prompt effects, which adds a practical data point to existing MLLM video benchmarks. The evaluation is systematic enough on the chosen datasets to document the bias and the prompt sensitivity. The soft spots are around generalization. ShanghaiTech and CHAD use controlled setups with weak supervision, so the observed bias and the F1 lift may not carry over to real deployments with variable lighting, camera motion, long untrimmed footage, or truly open-world conditions. The abstract does not mention error bars, multiple runs, or full prompt templates, which makes it harder to judge how stable the 0.09-to-0.64 shift really is. No results on more diverse or continuously monitored data are reported, so the claim about limitations for noisy surveillance rests on these two benchmarks. This paper is for people working on MLLM applications in video or surveillance who want a quick reality check on current zero-shot performance. It is not a methods paper with new architectures, but the measurements are honest and point to concrete next steps like recall-focused prompting or calibration. It deserves a serious referee because the empirical gap is documented clearly enough to guide follow-up work, even if the generalization to actual deployments needs more testing.

Referee Report

2 major / 2 minor

Summary. The paper systematically evaluates state-of-the-art multimodal LLMs on zero-shot anomaly detection in surveillance videos by reformulating VAD as a binary classification task under weak temporal supervision on the ShanghaiTech and CHAD benchmarks. It investigates the influence of prompt specificity and temporal window lengths on the precision-recall trade-off, revealing a conservative bias in zero-shot settings with high precision but recall collapse. Class-specific instructions are shown to improve peak F1-score on ShanghaiTech from 0.09 to 0.64, though recall remains a bottleneck, highlighting performance gaps for practical open-world surveillance.

Significance. This work is significant as it provides an empirical reality check on the readiness of MLLMs for real-world surveillance applications, a domain where traditional VAD methods have been dominant. By demonstrating the conservative bias and the potential of prompt engineering to shift decision boundaries, it opens avenues for recall-oriented prompting and model calibration. If validated with more robust experimental protocols, these findings could influence the design of future MLLM-based systems for video understanding in noisy environments.

major comments (2)

[Section 4 (Experiments)] The peak F1-score of 0.64 achieved with class-specific instructions on ShanghaiTech is reported without accompanying error bars, number of independent trials, or statistical significance measures. This undermines confidence in the claimed improvement from 0.09, as it is unclear if the result is robust to variations in prompt phrasing or random seeds.
[Section 5 (Discussion)] The assertion that recall is a critical bottleneck limiting practical utility in surveillance relies on the representativeness of ShanghaiTech and CHAD under weak temporal supervision and 1-3s windows. No additional results on more challenging conditions such as variable lighting, camera motion, or long untrimmed videos are provided to support generalization to open-world deployments.

minor comments (2)

[Abstract] The abstract refers to 'full prompt templates' but these are not included in the main text or supplementary material, which would aid reproducibility.
[Related Work] Consider adding more recent references on MLLM applications in video anomaly detection to better contextualize the contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which has helped us strengthen the experimental reporting and better delineate the scope of our claims. We address each major comment below and indicate the corresponding revisions to the manuscript.

read point-by-point responses

Referee: [Section 4 (Experiments)] The peak F1-score of 0.64 achieved with class-specific instructions on ShanghaiTech is reported without accompanying error bars, number of independent trials, or statistical significance measures. This undermines confidence in the claimed improvement from 0.09, as it is unclear if the result is robust to variations in prompt phrasing or random seeds.

Authors: We agree that the absence of variability measures and statistical tests limits the strength of the reported improvement. In the revised manuscript we have rerun the class-specific prompting experiments on ShanghaiTech across five independent trials that incorporate both prompt paraphrasing and any model stochasticity. We now report mean peak F1 scores together with standard deviations and include a paired t-test confirming that the gain from 0.09 to 0.64 is statistically significant (p < 0.05). These updates appear in the revised Section 4 and the associated tables. revision: yes
Referee: [Section 5 (Discussion)] The assertion that recall is a critical bottleneck limiting practical utility in surveillance relies on the representativeness of ShanghaiTech and CHAD under weak temporal supervision and 1-3s windows. No additional results on more challenging conditions such as variable lighting, camera motion, or long untrimmed videos are provided to support generalization to open-world deployments.

Authors: We acknowledge that direct evidence on additional stressors would further support generalization claims. ShanghaiTech and CHAD remain the standard benchmarks for this task and already contain substantial scene diversity; our controlled setting with weak supervision and short windows was chosen to isolate the effect of prompt specificity. We have expanded the Discussion to explicitly state that the conservative bias and recall collapse we observe are likely to be at least as severe under variable lighting, camera motion, or long untrimmed sequences, and we list these conditions as important directions for future work. No new experimental results on those conditions are added in the current revision. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmarking with direct model evaluations on public datasets

full rationale

The paper performs an empirical evaluation of MLLMs on ShanghaiTech and CHAD by reformulating VAD as binary classification under weak temporal supervision, testing prompt variants and window lengths, and reporting observed precision-recall trade-offs and F1 improvements. No derivation chain, equations, fitted parameters renamed as predictions, or self-citations are used to justify any central result; all reported numbers are direct outputs from running the models on the chosen benchmarks. The study is therefore self-contained against external benchmarks and contains no load-bearing step that reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard computer-vision benchmarks and the modeling choice to treat VAD as prompted binary classification; no free parameters, invented entities, or non-standard axioms are introduced.

axioms (1)

domain assumption Weak temporal supervision and short fixed windows suffice to evaluate MLLM reasoning for anomaly detection.
The paper reformulates VAD as binary classification under weak temporal supervision to test the models.

pith-pipeline@v0.9.0 · 5776 in / 1319 out tokens · 52626 ms · 2026-05-21T12:18:17.296039+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

From Frames to Events: Rethinking Evaluation in Human-Centric Video Anomaly Detection
cs.CV 2026-04 conditional novelty 8.0

State-of-the-art pose-based video anomaly detection models achieve over 52% frame-level AUC-ROC but drop below 10% event-level precision and 0.11 average F1 when evaluated with temporal action localization metrics on ...

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · cited by 1 Pith paper · 6 internal anchors

[1]

Lvbench: An extreme long video understanding benchmark,

W. Wang, Z. He, W. Hong, Y . Cheng, X. Zhang, J. Qi, M. Ding, X. Gu, S. Huang, B. Xu,et al., “Lvbench: An extreme long video understanding benchmark,” inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22958–22967, 2025

work page 2025
[2]

Mmbench-video: A long-form multi-shot benchmark for holistic video understanding,

X. Fang, K. Mao, H. Duan, X. Zhao, Y . Li, D. Lin, and K. Chen, “Mmbench-video: A long-form multi-shot benchmark for holistic video understanding,”Advances in Neural Information Processing Systems, vol. 37, pp. 89098–89124, 2024

work page 2024
[3]

A survey on video anomaly detection via deep learning: Human, vehicle, and environment,

G. A. Noghre, A. D. Pazho, and H. Tabkhi, “A survey on video anomaly detection via deep learning: Human, vehicle, and environment,”arXiv preprint arXiv:2508.14203, 2025

work page arXiv 2025
[4]

Ancilia: Scalable intelligent video surveillance for the artificial intelligence of things,

A. D. Pazho, C. Neff, G. A. Noghre, B. R. Ardabili, S. Yao, M. Baharani, and H. Tabkhi, “Ancilia: Scalable intelligent video surveillance for the artificial intelligence of things,”IEEE Internet of Things Journal, vol. 10, no. 17, pp. 14940–14951, 2023

work page 2023
[5]

From lab to field: Real-world evaluation of an ai-driven smart video solution to enhance community safety,

S. Yao, B. R. Ardabili, A. D. Pazho, G. A. Noghre, C. Neff, L. Bourque, and H. Tabkhi, “From lab to field: Real-world evaluation of an ai-driven smart video solution to enhance community safety,”Internet of Things, p. 101716, 2025

work page 2025
[6]

Evaluating the effectiveness of video anomaly detection in the wild: Online learning and inference for real-world deployment,

S. Yao, G. A. Noghre, A. D. Pazho, and H. Tabkhi, “Evaluating the effectiveness of video anomaly detection in the wild: Online learning and inference for real-world deployment,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4832– 4841, 2024

work page 2024
[7]

Alfred: An active learning framework for real-world semi-supervised anomaly detection with adaptive thresholds,

S. Yao, G. A. Noghre, A. D. Pazho, and H. Tabkhi, “Alfred: An active learning framework for real-world semi-supervised anomaly detection with adaptive thresholds,”arXiv preprint arXiv:2508.09058, 2025

work page arXiv 2025
[8]

Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding,

J. Liu, A. Shahroudy, M. Perez, G. Wang, L.-Y . Duan, and A. C. Kot, “Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding,”IEEE transactions on pattern analysis and machine intelligence, vol. 42, no. 10, pp. 2684–2701, 2019

work page 2019
[9]

From offline to periodic adaptation for pose-based shoplifting detection in real-world retail security,

S. Yao, N. Rashvand, A. D. Pazho, and H. Tabkhi, “From offline to periodic adaptation for pose-based shoplifting detection in real-world retail security,”IEEE Internet of Things Journal, 2026

work page 2026
[10]

Towards adaptive human-centric video anomaly detection: A comprehensive framework and a new benchmark,

A. D. Pazho, S. Yao, G. A. Noghre, B. R. Ardabili, V . Katariya, and H. Tabkhi, “Towards adaptive human-centric video anomaly detection: A comprehensive framework and a new benchmark,”IEEE Transactions on Circuits and Systems for Video Technology, 2025

work page 2025
[11]

Future frame prediction for anomaly detection–a new baseline,

W. Liu, W. Luo, D. Lian, and S. Gao, “Future frame prediction for anomaly detection–a new baseline,” inProceedings of the IEEE conference on computer vision and pattern recognition, pp. 6536–6545, 2018

work page 2018
[12]

Chad: Charlotte anomaly dataset,

A. Danesh Pazho, G. Alinezhad Noghre, B. Rahimi Ardabili, C. Neff, and H. Tabkhi, “Chad: Charlotte anomaly dataset,” inScandinavian Conference on Image Analysis, pp. 50–66, Springer, 2023

work page 2023
[13]

Human-centric video anomaly detection through spatio-temporal pose tokenization and trans- former,

G. A. Noghre, A. D. Pazho, and H. Tabkhi, “Human-centric video anomaly detection through spatio-temporal pose tokenization and trans- former,”arXiv preprint arXiv:2408.15185, 2024

work page arXiv 2024
[14]

Shopformer: Transformer-based framework for detecting shoplifting via human pose,

N. Rashvand, G. A. Noghre, A. D. Pazho, B. R. Ardabili, and H. Tabkhi, “Shopformer: Transformer-based framework for detecting shoplifting via human pose,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5761–5770, 2025

work page 2025
[15]

Soft computing techniques-based digital video forensics for fraud medical anomaly detection,

S. K. Nanda, D. Ghai, P. Ingole, and S. Pande, “Soft computing techniques-based digital video forensics for fraud medical anomaly detection,”Computer Assisted Methods in Engineering and Science, vol. 30, no. 2, pp. 111–130, 2023

work page 2023
[16]

Unsupervised anomaly detection based method of risk evaluation for road traffic accident: Unsupervised anomaly detection based method of risk evaluation for road traffic accident,

C. Zhao, X. Chang, T. Xie, H. Fujita, and J. Wu, “Unsupervised anomaly detection based method of risk evaluation for road traffic accident: Unsupervised anomaly detection based method of risk evaluation for road traffic accident,”Applied Intelligence, vol. 53, no. 1, pp. 369–384, 2023

work page 2023
[17]

Enhancing traffic flow in heterogeneous freeways: Integration of mul- tivariable extremum seeking and filtered feedback linearization control,

P. K. Shahri, A. Rahmanidehkordi, A. Ghaffari, and A. H. Ghasemi, “Enhancing traffic flow in heterogeneous freeways: Integration of mul- tivariable extremum seeking and filtered feedback linearization control,” IEEE Access, vol. 13, pp. 129573–129587, 2025

work page 2025
[18]

A deep encoder-decoder network for anomaly detection in driving trajectory behavior under spatio-temporal context,

W. Yu and Q. Huang, “A deep encoder-decoder network for anomaly detection in driving trajectory behavior under spatio-temporal context,” International Journal of Applied Earth Observation and Geoinforma- tion, vol. 115, p. 103115, 2022

work page 2022
[19]

Traffic density control for heterogeneous highway systems with input constraints,

A. Rahmanidehkordi and A. H. Ghasemi, “Traffic density control for heterogeneous highway systems with input constraints,”IEEE Control Systems Letters, vol. 8, pp. 2787–2792, 2024

work page 2024
[20]

An exploratory study on human-centric video anomaly detection through variational autoencoders and trajectory prediction,

G. A. Noghre, A. D. Pazho, and H. Tabkhi, “An exploratory study on human-centric video anomaly detection through variational autoencoders and trajectory prediction,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops, pp. 995–1004, January 2024

work page 2024
[21]

Exploring pose-based anomaly detection for retail security: A real- world shoplifting dataset and benchmark,

N. Rashvand, G. A. Noghre, A. D. Pazho, S. Yao, and H. Tabkhi, “Exploring pose-based anomaly detection for retail security: A real- world shoplifting dataset and benchmark,” inProceedings of the Winter Conference on Applications of Computer Vision (WACV) Workshops, pp. 1123–1131, February 2025

work page 2025
[22]

Gaussian process regression-based video anomaly detection and localization with hierar- chical feature representation,

K.-W. Cheng, Y .-T. Chen, and W.-H. Fang, “Gaussian process regression-based video anomaly detection and localization with hierar- chical feature representation,”IEEE Transactions on Image Processing, vol. 24, no. 12, pp. 5288–5301, 2015

work page 2015
[23]

Toward abnormal trajectory and event detection in video surveillance,

S. Cos ¸ar, G. Donatiello, V . Bogorny, C. Garate, L. O. Alvares, and F. Br´emond, “Toward abnormal trajectory and event detection in video surveillance,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 27, no. 3, pp. 683–695, 2016

work page 2016
[24]

Weakly-supervised video anomaly detection with robust temporal fea- ture magnitude learning,

Y . Tian, G. Pang, Y . Chen, R. Singh, J. W. Verjans, and G. Carneiro, “Weakly-supervised video anomaly detection with robust temporal fea- ture magnitude learning,” inProceedings of the IEEE/CVF international conference on computer vision, pp. 4975–4986, 2021

work page 2021
[25]

Essl: Enhanced spatio-temporal self-selective learning framework for unsupervised video anomaly de- tection.,

Q. Li, X. Pan, F. Xiao, and B. Bhanu, “Essl: Enhanced spatio-temporal self-selective learning framework for unsupervised video anomaly de- tection.,” inECAI, pp. 1398–1405, 2023

work page 2023
[26]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

work page 2017
[27]

Detection of shoplifting on video using a hybrid network,

L. Kirichenko, T. Radivilova, B. Sydorenko, and S. Yakovlev, “Detection of shoplifting on video using a hybrid network,”Computation, vol. 10, no. 11, p. 199, 2022

work page 2022
[28]

Criminal intention de- tection at early stages of shoplifting cases by using 3d convolutional neural networks,

G. A. Mart ´ınez-Mascorro, J. R. Abreu-Pederzini, J. C. Ortiz-Bayliss, A. Garcia-Collantes, and H. Terashima-Mar ´ın, “Criminal intention de- tection at early stages of shoplifting cases by using 3d convolutional neural networks,”Computation, vol. 9, no. 2, p. 24, 2021

work page 2021
[29]

Normalizing flows for human pose anomaly detection,

O. Hirschorn and S. Avidan, “Normalizing flows for human pose anomaly detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13545–13554, 2023

work page 2023
[30]

Regularity learning via explicit distribution modeling for skeletal video anomaly detection,

S. Yu, Z. Zhao, H. Fang, A. Deng, H. Su, D. Wang, W. Gan, C. Lu, and W. Wu, “Regularity learning via explicit distribution modeling for skeletal video anomaly detection,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 8, pp. 6661–6673, 2023

work page 2023
[31]

Graph embedded pose clustering for anomaly detection,

A. Markovitz, G. Sharir, I. Friedman, L. Zelnik-Manor, and S. Avidan, “Graph embedded pose clustering for anomaly detection,” inProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10539–10547, 2020

work page 2020
[32]

Large Language Models: A Survey

S. Minaee, T. Mikolov, N. Nikzad, M. Chenaghlu, R. Socher, X. Am- atriain, and J. Gao, “Large language models: A survey,”arXiv preprint arXiv:2402.06196, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

A survey of gpt-3 family large language models including chatgpt and gpt-4,

K. S. Kalyan, “A survey of gpt-3 family large language models including chatgpt and gpt-4,”Natural Language Processing Journal, vol. 6, p. 100048, 2024

work page 2024
[34]

Quantifying influencer impact on affective polarization,

R. Rashid, J. Melton, O. Ghorbani, S. Krishnan, S. Reid, and G. Tere- janu, “Quantifying influencer impact on affective polarization,” in 2024 International Conference on Machine Learning and Applications (ICMLA), pp. 1135–1140, 2024

work page 2024
[35]

A survey on large language models from concept to implementation,

C. Wang, J. Zhao, and J. Gong, “A survey on large language models from concept to implementation,”arXiv preprint arXiv:2403.18969, 2024

work page arXiv 2024
[36]

Examining radiation therapy planning knowledge in large language models,

O. Ghorbani, A. Helmy, Q. J. Wu, and Y . Ge, “Examining radiation therapy planning knowledge in large language models,” inProceedings of the 16th ACM International Conference on Bioinformatics, Compu- tational Biology, and Health Informatics, pp. 1–1, 2025

work page 2025
[37]

Mm- llms: Recent advances in multimodal large language models,

D. Zhang, Y . Yu, J. Dong, C. Li, D. Su, C. Chu, and D. Yu, “Mm- llms: Recent advances in multimodal large language models,”Findings of the Association for Computational Linguistics: ACL 2024, pp. 12401– 12430, 2024

work page 2024
[38]

A survey on multimodal large language models,

S. Yin, C. Fu, S. Zhao, K. Li, X. Sun, T. Xu, and E. Chen, “A survey on multimodal large language models,”National Science Review, vol. 11, no. 12, p. nwae403, 2024

work page 2024
[39]

A systematic review of multi-modal large language models on domain-specific applications,

S. Li, K. W. Wong, G. Wang, and T.-T. Duong, “A systematic review of multi-modal large language models on domain-specific applications,” Artificial Intelligence Review, vol. 58, no. 12, pp. 1–47, 2025

work page 2025
[40]

Pre-trained video generative models as world simulators,

H. He, Y . Zhang, L. Lin, Z. Xu, and L. Pan, “Pre-trained video generative models as world simulators,”arXiv preprint arXiv:2502.07825, 2025

work page arXiv 2025
[41]

VideoPoet: A Large Language Model for Zero-Shot Video Generation

D. Kondratyuk, L. Yu, X. Gu, J. Lezama, J. Huang, G. Schindler, R. Hornung, V . Birodkar, J. Yan, M.-C. Chiu,et al., “Videopoet: A large language model for zero-shot video generation,”arXiv preprint arXiv:2312.14125, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

Make-A-Video: Text-to-Video Generation without Text-Video Data

U. Singer, A. Polyak, T. Hayes, X. Yin, J. An, S. Zhang, Q. Hu, H. Yang, O. Ashual, O. Gafni,et al., “Make-a-video: Text-to-video generation without text-video data,”arXiv preprint arXiv:2209.14792, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[43]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

G. Team, P. Georgiev, V . I. Lei, R. Burnell, L. Bai, A. Gulati, G. Tanzer, D. Vincent, Z. Pan, S. Wang,et al., “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,”arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

GPT-4o System Card

A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford,et al., “Gpt-4o system card,”arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

Video-bench: Human-aligned video generation benchmark,

H. Han, S. Li, J. Chen, Y . Yuan, Y . Wu, Y . Deng, C. T. Leong, H. Du, J. Fu, Y . Li,et al., “Video-bench: Human-aligned video generation benchmark,” inProceedings of the Computer Vision and Pattern Recog- nition Conference, pp. 18858–18868, 2025

work page 2025
[46]

VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

D. Zheng, Z. Huang, H. Liu, K. Zou, Y . He, F. Zhang, L. Gu, Y . Zhang, J. He, W.-S. Zheng,et al., “Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness,”arXiv preprint arXiv:2503.21755, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis,

C. Fu, Y . Dai, Y . Luo, L. Li, S. Ren, R. Zhang, Z. Wang, C. Zhou, Y . Shen, M. Zhang,et al., “Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis,” inPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 24108–24118, 2025

work page 2025
[48]

Tempcompass: Do video llms really understand videos?,

Y . Liu, S. Li, Y . Liu, Y . Wang, S. Ren, L. Li, S. Chen, X. Sun, and L. Hou, “Tempcompass: Do video llms really understand videos?,” in Findings of the Association for Computational Linguistics: ACL 2024, pp. 8731–8772, 2024

work page 2024
[49]

Ucvl: a benchmark for crime surveillance video analysis with large models,

H. Chen, D. Yi, M. Cao, C. Huang, G. Zhu, and J. Wang, “Ucvl: a benchmark for crime surveillance video analysis with large models,” in2025 5th International Conference on Artificial Intelligence and Industrial Technology Applications (AIITA), pp. 2051–2057, IEEE, 2025

work page 2051

[1] [1]

Lvbench: An extreme long video understanding benchmark,

W. Wang, Z. He, W. Hong, Y . Cheng, X. Zhang, J. Qi, M. Ding, X. Gu, S. Huang, B. Xu,et al., “Lvbench: An extreme long video understanding benchmark,” inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22958–22967, 2025

work page 2025

[2] [2]

Mmbench-video: A long-form multi-shot benchmark for holistic video understanding,

X. Fang, K. Mao, H. Duan, X. Zhao, Y . Li, D. Lin, and K. Chen, “Mmbench-video: A long-form multi-shot benchmark for holistic video understanding,”Advances in Neural Information Processing Systems, vol. 37, pp. 89098–89124, 2024

work page 2024

[3] [3]

A survey on video anomaly detection via deep learning: Human, vehicle, and environment,

G. A. Noghre, A. D. Pazho, and H. Tabkhi, “A survey on video anomaly detection via deep learning: Human, vehicle, and environment,”arXiv preprint arXiv:2508.14203, 2025

work page arXiv 2025

[4] [4]

Ancilia: Scalable intelligent video surveillance for the artificial intelligence of things,

A. D. Pazho, C. Neff, G. A. Noghre, B. R. Ardabili, S. Yao, M. Baharani, and H. Tabkhi, “Ancilia: Scalable intelligent video surveillance for the artificial intelligence of things,”IEEE Internet of Things Journal, vol. 10, no. 17, pp. 14940–14951, 2023

work page 2023

[5] [5]

From lab to field: Real-world evaluation of an ai-driven smart video solution to enhance community safety,

S. Yao, B. R. Ardabili, A. D. Pazho, G. A. Noghre, C. Neff, L. Bourque, and H. Tabkhi, “From lab to field: Real-world evaluation of an ai-driven smart video solution to enhance community safety,”Internet of Things, p. 101716, 2025

work page 2025

[6] [6]

Evaluating the effectiveness of video anomaly detection in the wild: Online learning and inference for real-world deployment,

S. Yao, G. A. Noghre, A. D. Pazho, and H. Tabkhi, “Evaluating the effectiveness of video anomaly detection in the wild: Online learning and inference for real-world deployment,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4832– 4841, 2024

work page 2024

[7] [7]

Alfred: An active learning framework for real-world semi-supervised anomaly detection with adaptive thresholds,

S. Yao, G. A. Noghre, A. D. Pazho, and H. Tabkhi, “Alfred: An active learning framework for real-world semi-supervised anomaly detection with adaptive thresholds,”arXiv preprint arXiv:2508.09058, 2025

work page arXiv 2025

[8] [8]

Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding,

J. Liu, A. Shahroudy, M. Perez, G. Wang, L.-Y . Duan, and A. C. Kot, “Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding,”IEEE transactions on pattern analysis and machine intelligence, vol. 42, no. 10, pp. 2684–2701, 2019

work page 2019

[9] [9]

From offline to periodic adaptation for pose-based shoplifting detection in real-world retail security,

S. Yao, N. Rashvand, A. D. Pazho, and H. Tabkhi, “From offline to periodic adaptation for pose-based shoplifting detection in real-world retail security,”IEEE Internet of Things Journal, 2026

work page 2026

[10] [10]

Towards adaptive human-centric video anomaly detection: A comprehensive framework and a new benchmark,

A. D. Pazho, S. Yao, G. A. Noghre, B. R. Ardabili, V . Katariya, and H. Tabkhi, “Towards adaptive human-centric video anomaly detection: A comprehensive framework and a new benchmark,”IEEE Transactions on Circuits and Systems for Video Technology, 2025

work page 2025

[11] [11]

Future frame prediction for anomaly detection–a new baseline,

W. Liu, W. Luo, D. Lian, and S. Gao, “Future frame prediction for anomaly detection–a new baseline,” inProceedings of the IEEE conference on computer vision and pattern recognition, pp. 6536–6545, 2018

work page 2018

[12] [12]

Chad: Charlotte anomaly dataset,

A. Danesh Pazho, G. Alinezhad Noghre, B. Rahimi Ardabili, C. Neff, and H. Tabkhi, “Chad: Charlotte anomaly dataset,” inScandinavian Conference on Image Analysis, pp. 50–66, Springer, 2023

work page 2023

[13] [13]

Human-centric video anomaly detection through spatio-temporal pose tokenization and trans- former,

G. A. Noghre, A. D. Pazho, and H. Tabkhi, “Human-centric video anomaly detection through spatio-temporal pose tokenization and trans- former,”arXiv preprint arXiv:2408.15185, 2024

work page arXiv 2024

[14] [14]

Shopformer: Transformer-based framework for detecting shoplifting via human pose,

N. Rashvand, G. A. Noghre, A. D. Pazho, B. R. Ardabili, and H. Tabkhi, “Shopformer: Transformer-based framework for detecting shoplifting via human pose,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5761–5770, 2025

work page 2025

[15] [15]

Soft computing techniques-based digital video forensics for fraud medical anomaly detection,

S. K. Nanda, D. Ghai, P. Ingole, and S. Pande, “Soft computing techniques-based digital video forensics for fraud medical anomaly detection,”Computer Assisted Methods in Engineering and Science, vol. 30, no. 2, pp. 111–130, 2023

work page 2023

[16] [16]

Unsupervised anomaly detection based method of risk evaluation for road traffic accident: Unsupervised anomaly detection based method of risk evaluation for road traffic accident,

C. Zhao, X. Chang, T. Xie, H. Fujita, and J. Wu, “Unsupervised anomaly detection based method of risk evaluation for road traffic accident: Unsupervised anomaly detection based method of risk evaluation for road traffic accident,”Applied Intelligence, vol. 53, no. 1, pp. 369–384, 2023

work page 2023

[17] [17]

Enhancing traffic flow in heterogeneous freeways: Integration of mul- tivariable extremum seeking and filtered feedback linearization control,

P. K. Shahri, A. Rahmanidehkordi, A. Ghaffari, and A. H. Ghasemi, “Enhancing traffic flow in heterogeneous freeways: Integration of mul- tivariable extremum seeking and filtered feedback linearization control,” IEEE Access, vol. 13, pp. 129573–129587, 2025

work page 2025

[18] [18]

A deep encoder-decoder network for anomaly detection in driving trajectory behavior under spatio-temporal context,

W. Yu and Q. Huang, “A deep encoder-decoder network for anomaly detection in driving trajectory behavior under spatio-temporal context,” International Journal of Applied Earth Observation and Geoinforma- tion, vol. 115, p. 103115, 2022

work page 2022

[19] [19]

Traffic density control for heterogeneous highway systems with input constraints,

A. Rahmanidehkordi and A. H. Ghasemi, “Traffic density control for heterogeneous highway systems with input constraints,”IEEE Control Systems Letters, vol. 8, pp. 2787–2792, 2024

work page 2024

[20] [20]

An exploratory study on human-centric video anomaly detection through variational autoencoders and trajectory prediction,

G. A. Noghre, A. D. Pazho, and H. Tabkhi, “An exploratory study on human-centric video anomaly detection through variational autoencoders and trajectory prediction,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops, pp. 995–1004, January 2024

work page 2024

[21] [21]

Exploring pose-based anomaly detection for retail security: A real- world shoplifting dataset and benchmark,

N. Rashvand, G. A. Noghre, A. D. Pazho, S. Yao, and H. Tabkhi, “Exploring pose-based anomaly detection for retail security: A real- world shoplifting dataset and benchmark,” inProceedings of the Winter Conference on Applications of Computer Vision (WACV) Workshops, pp. 1123–1131, February 2025

work page 2025

[22] [22]

Gaussian process regression-based video anomaly detection and localization with hierar- chical feature representation,

K.-W. Cheng, Y .-T. Chen, and W.-H. Fang, “Gaussian process regression-based video anomaly detection and localization with hierar- chical feature representation,”IEEE Transactions on Image Processing, vol. 24, no. 12, pp. 5288–5301, 2015

work page 2015

[23] [23]

Toward abnormal trajectory and event detection in video surveillance,

S. Cos ¸ar, G. Donatiello, V . Bogorny, C. Garate, L. O. Alvares, and F. Br´emond, “Toward abnormal trajectory and event detection in video surveillance,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 27, no. 3, pp. 683–695, 2016

work page 2016

[24] [24]

Weakly-supervised video anomaly detection with robust temporal fea- ture magnitude learning,

Y . Tian, G. Pang, Y . Chen, R. Singh, J. W. Verjans, and G. Carneiro, “Weakly-supervised video anomaly detection with robust temporal fea- ture magnitude learning,” inProceedings of the IEEE/CVF international conference on computer vision, pp. 4975–4986, 2021

work page 2021

[25] [25]

Essl: Enhanced spatio-temporal self-selective learning framework for unsupervised video anomaly de- tection.,

Q. Li, X. Pan, F. Xiao, and B. Bhanu, “Essl: Enhanced spatio-temporal self-selective learning framework for unsupervised video anomaly de- tection.,” inECAI, pp. 1398–1405, 2023

work page 2023

[26] [26]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

work page 2017

[27] [27]

Detection of shoplifting on video using a hybrid network,

L. Kirichenko, T. Radivilova, B. Sydorenko, and S. Yakovlev, “Detection of shoplifting on video using a hybrid network,”Computation, vol. 10, no. 11, p. 199, 2022

work page 2022

[28] [28]

Criminal intention de- tection at early stages of shoplifting cases by using 3d convolutional neural networks,

G. A. Mart ´ınez-Mascorro, J. R. Abreu-Pederzini, J. C. Ortiz-Bayliss, A. Garcia-Collantes, and H. Terashima-Mar ´ın, “Criminal intention de- tection at early stages of shoplifting cases by using 3d convolutional neural networks,”Computation, vol. 9, no. 2, p. 24, 2021

work page 2021

[29] [29]

Normalizing flows for human pose anomaly detection,

O. Hirschorn and S. Avidan, “Normalizing flows for human pose anomaly detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13545–13554, 2023

work page 2023

[30] [30]

Regularity learning via explicit distribution modeling for skeletal video anomaly detection,

S. Yu, Z. Zhao, H. Fang, A. Deng, H. Su, D. Wang, W. Gan, C. Lu, and W. Wu, “Regularity learning via explicit distribution modeling for skeletal video anomaly detection,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 8, pp. 6661–6673, 2023

work page 2023

[31] [31]

Graph embedded pose clustering for anomaly detection,

A. Markovitz, G. Sharir, I. Friedman, L. Zelnik-Manor, and S. Avidan, “Graph embedded pose clustering for anomaly detection,” inProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10539–10547, 2020

work page 2020

[32] [32]

Large Language Models: A Survey

S. Minaee, T. Mikolov, N. Nikzad, M. Chenaghlu, R. Socher, X. Am- atriain, and J. Gao, “Large language models: A survey,”arXiv preprint arXiv:2402.06196, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

A survey of gpt-3 family large language models including chatgpt and gpt-4,

K. S. Kalyan, “A survey of gpt-3 family large language models including chatgpt and gpt-4,”Natural Language Processing Journal, vol. 6, p. 100048, 2024

work page 2024

[34] [34]

Quantifying influencer impact on affective polarization,

R. Rashid, J. Melton, O. Ghorbani, S. Krishnan, S. Reid, and G. Tere- janu, “Quantifying influencer impact on affective polarization,” in 2024 International Conference on Machine Learning and Applications (ICMLA), pp. 1135–1140, 2024

work page 2024

[35] [35]

A survey on large language models from concept to implementation,

C. Wang, J. Zhao, and J. Gong, “A survey on large language models from concept to implementation,”arXiv preprint arXiv:2403.18969, 2024

work page arXiv 2024

[36] [36]

Examining radiation therapy planning knowledge in large language models,

O. Ghorbani, A. Helmy, Q. J. Wu, and Y . Ge, “Examining radiation therapy planning knowledge in large language models,” inProceedings of the 16th ACM International Conference on Bioinformatics, Compu- tational Biology, and Health Informatics, pp. 1–1, 2025

work page 2025

[37] [37]

Mm- llms: Recent advances in multimodal large language models,

D. Zhang, Y . Yu, J. Dong, C. Li, D. Su, C. Chu, and D. Yu, “Mm- llms: Recent advances in multimodal large language models,”Findings of the Association for Computational Linguistics: ACL 2024, pp. 12401– 12430, 2024

work page 2024

[38] [38]

A survey on multimodal large language models,

S. Yin, C. Fu, S. Zhao, K. Li, X. Sun, T. Xu, and E. Chen, “A survey on multimodal large language models,”National Science Review, vol. 11, no. 12, p. nwae403, 2024

work page 2024

[39] [39]

A systematic review of multi-modal large language models on domain-specific applications,

S. Li, K. W. Wong, G. Wang, and T.-T. Duong, “A systematic review of multi-modal large language models on domain-specific applications,” Artificial Intelligence Review, vol. 58, no. 12, pp. 1–47, 2025

work page 2025

[40] [40]

Pre-trained video generative models as world simulators,

H. He, Y . Zhang, L. Lin, Z. Xu, and L. Pan, “Pre-trained video generative models as world simulators,”arXiv preprint arXiv:2502.07825, 2025

work page arXiv 2025

[41] [41]

VideoPoet: A Large Language Model for Zero-Shot Video Generation

D. Kondratyuk, L. Yu, X. Gu, J. Lezama, J. Huang, G. Schindler, R. Hornung, V . Birodkar, J. Yan, M.-C. Chiu,et al., “Videopoet: A large language model for zero-shot video generation,”arXiv preprint arXiv:2312.14125, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[42] [42]

Make-A-Video: Text-to-Video Generation without Text-Video Data

U. Singer, A. Polyak, T. Hayes, X. Yin, J. An, S. Zhang, Q. Hu, H. Yang, O. Ashual, O. Gafni,et al., “Make-a-video: Text-to-video generation without text-video data,”arXiv preprint arXiv:2209.14792, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[43] [43]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

G. Team, P. Georgiev, V . I. Lei, R. Burnell, L. Bai, A. Gulati, G. Tanzer, D. Vincent, Z. Pan, S. Wang,et al., “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,”arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[44] [44]

GPT-4o System Card

A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford,et al., “Gpt-4o system card,”arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[45] [45]

Video-bench: Human-aligned video generation benchmark,

H. Han, S. Li, J. Chen, Y . Yuan, Y . Wu, Y . Deng, C. T. Leong, H. Du, J. Fu, Y . Li,et al., “Video-bench: Human-aligned video generation benchmark,” inProceedings of the Computer Vision and Pattern Recog- nition Conference, pp. 18858–18868, 2025

work page 2025

[46] [46]

VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

D. Zheng, Z. Huang, H. Liu, K. Zou, Y . He, F. Zhang, L. Gu, Y . Zhang, J. He, W.-S. Zheng,et al., “Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness,”arXiv preprint arXiv:2503.21755, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[47] [47]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis,

C. Fu, Y . Dai, Y . Luo, L. Li, S. Ren, R. Zhang, Z. Wang, C. Zhou, Y . Shen, M. Zhang,et al., “Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis,” inPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 24108–24118, 2025

work page 2025

[48] [48]

Tempcompass: Do video llms really understand videos?,

Y . Liu, S. Li, Y . Liu, Y . Wang, S. Ren, L. Li, S. Chen, X. Sun, and L. Hou, “Tempcompass: Do video llms really understand videos?,” in Findings of the Association for Computational Linguistics: ACL 2024, pp. 8731–8772, 2024

work page 2024

[49] [49]

Ucvl: a benchmark for crime surveillance video analysis with large models,

H. Chen, D. Yi, M. Cao, C. Huang, G. Zhu, and J. Wang, “Ucvl: a benchmark for crime surveillance video analysis with large models,” in2025 5th International Conference on Artificial Intelligence and Industrial Technology Applications (AIITA), pp. 2051–2057, IEEE, 2025

work page 2051