Are Multimodal LLMs Ready for Surveillance? A Reality Check on Zero-Shot Anomaly Detection in the Wild
Pith reviewed 2026-05-21 12:18 UTC · model grok-4.3
The pith
Multimodal LLMs exhibit conservative bias in zero-shot video anomaly detection, favoring normal labels and collapsing recall.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Multimodal large language models can treat video anomaly detection as a language-guided binary classification task, but in zero-shot settings they display a pronounced conservative bias that favors the normal class, producing high precision at the cost of recall collapse. Class-specific instructions shift this boundary and raise the peak F1-score on ShanghaiTech from 0.09 to 0.64, although recall remains a critical bottleneck. The results point to a sizable performance gap for MLLMs in the noisy, open-world environments of actual surveillance.
What carries the argument
Reformulating video anomaly detection as a prompt-driven binary classification task inside multimodal LLMs, using varying prompt specificity and 1-3 second temporal windows to expose and adjust the normal-versus-anomalous decision boundary.
If this is right
- Class-specific instructions can shift the decision boundary and raise peak F1-scores substantially on standard benchmarks.
- Recall remains the dominant practical limitation even after prompt adjustments.
- MLLMs exhibit a clear performance gap when faced with the noise typical of open-world surveillance.
- Recall-oriented prompting and model calibration are required to close the gap for real deployments.
Where Pith is reading between the lines
- The same bias may appear in other rare-event detection settings such as medical video or industrial monitoring where missing positives is costly.
- Hybrid pipelines that combine MLLM reasoning with traditional reconstruction cues could address the recall shortfall without full retraining.
- Investigating the training data distribution that produces default normal predictions could guide more balanced pre-training for safety-critical uses.
Load-bearing premise
The assumption that results on the ShanghaiTech and CHAD benchmarks under weak temporal supervision and chosen prompt styles generalize to the noisy, open-world conditions of actual surveillance deployments.
What would settle it
Running the same models on additional uncurated real-time surveillance streams containing varied noise levels and anomalies, then checking whether recall stays near zero or rises measurably.
Figures
read the original abstract
Multimodal large language models (MLLMs) have demonstrated impressive general competence in video understanding, yet their reliability for real-world Video Anomaly Detection (VAD) remains largely unexplored. Unlike conventional pipelines relying on reconstruction or pose-based cues, MLLMs enable a paradigm shift: treating anomaly detection as a language-guided reasoning task. In this work, we systematically evaluate state-of-the-art MLLMs on the ShanghaiTech and CHAD benchmarks by reformulating VAD as a binary classification task under weak temporal supervision. We investigate how prompt specificity and temporal window lengths (1s--3s) influence performance, focusing on the precision--recall trade-off. Our findings reveal a pronounced conservative bias in zero-shot settings; while models exhibit high confidence, they disproportionately favor the 'normal' class, resulting in high precision but a recall collapse that limits practical utility. We demonstrate that class-specific instructions can significantly shift this decision boundary, improving the peak F1-score on ShanghaiTech from 0.09 to 0.64, yet recall remains a critical bottleneck. These results highlight a significant performance gap for MLLMs in noisy environments and provide a foundation for future work in recall-oriented prompting and model calibration for open-world surveillance, which demands complex video understanding and reasoning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper systematically evaluates state-of-the-art multimodal LLMs on zero-shot anomaly detection in surveillance videos by reformulating VAD as a binary classification task under weak temporal supervision on the ShanghaiTech and CHAD benchmarks. It investigates the influence of prompt specificity and temporal window lengths on the precision-recall trade-off, revealing a conservative bias in zero-shot settings with high precision but recall collapse. Class-specific instructions are shown to improve peak F1-score on ShanghaiTech from 0.09 to 0.64, though recall remains a bottleneck, highlighting performance gaps for practical open-world surveillance.
Significance. This work is significant as it provides an empirical reality check on the readiness of MLLMs for real-world surveillance applications, a domain where traditional VAD methods have been dominant. By demonstrating the conservative bias and the potential of prompt engineering to shift decision boundaries, it opens avenues for recall-oriented prompting and model calibration. If validated with more robust experimental protocols, these findings could influence the design of future MLLM-based systems for video understanding in noisy environments.
major comments (2)
- [Section 4 (Experiments)] The peak F1-score of 0.64 achieved with class-specific instructions on ShanghaiTech is reported without accompanying error bars, number of independent trials, or statistical significance measures. This undermines confidence in the claimed improvement from 0.09, as it is unclear if the result is robust to variations in prompt phrasing or random seeds.
- [Section 5 (Discussion)] The assertion that recall is a critical bottleneck limiting practical utility in surveillance relies on the representativeness of ShanghaiTech and CHAD under weak temporal supervision and 1-3s windows. No additional results on more challenging conditions such as variable lighting, camera motion, or long untrimmed videos are provided to support generalization to open-world deployments.
minor comments (2)
- [Abstract] The abstract refers to 'full prompt templates' but these are not included in the main text or supplementary material, which would aid reproducibility.
- [Related Work] Consider adding more recent references on MLLM applications in video anomaly detection to better contextualize the contribution.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback, which has helped us strengthen the experimental reporting and better delineate the scope of our claims. We address each major comment below and indicate the corresponding revisions to the manuscript.
read point-by-point responses
-
Referee: [Section 4 (Experiments)] The peak F1-score of 0.64 achieved with class-specific instructions on ShanghaiTech is reported without accompanying error bars, number of independent trials, or statistical significance measures. This undermines confidence in the claimed improvement from 0.09, as it is unclear if the result is robust to variations in prompt phrasing or random seeds.
Authors: We agree that the absence of variability measures and statistical tests limits the strength of the reported improvement. In the revised manuscript we have rerun the class-specific prompting experiments on ShanghaiTech across five independent trials that incorporate both prompt paraphrasing and any model stochasticity. We now report mean peak F1 scores together with standard deviations and include a paired t-test confirming that the gain from 0.09 to 0.64 is statistically significant (p < 0.05). These updates appear in the revised Section 4 and the associated tables. revision: yes
-
Referee: [Section 5 (Discussion)] The assertion that recall is a critical bottleneck limiting practical utility in surveillance relies on the representativeness of ShanghaiTech and CHAD under weak temporal supervision and 1-3s windows. No additional results on more challenging conditions such as variable lighting, camera motion, or long untrimmed videos are provided to support generalization to open-world deployments.
Authors: We acknowledge that direct evidence on additional stressors would further support generalization claims. ShanghaiTech and CHAD remain the standard benchmarks for this task and already contain substantial scene diversity; our controlled setting with weak supervision and short windows was chosen to isolate the effect of prompt specificity. We have expanded the Discussion to explicitly state that the conservative bias and recall collapse we observe are likely to be at least as severe under variable lighting, camera motion, or long untrimmed sequences, and we list these conditions as important directions for future work. No new experimental results on those conditions are added in the current revision. revision: partial
Circularity Check
No circularity: empirical benchmarking with direct model evaluations on public datasets
full rationale
The paper performs an empirical evaluation of MLLMs on ShanghaiTech and CHAD by reformulating VAD as binary classification under weak temporal supervision, testing prompt variants and window lengths, and reporting observed precision-recall trade-offs and F1 improvements. No derivation chain, equations, fitted parameters renamed as predictions, or self-citations are used to justify any central result; all reported numbers are direct outputs from running the models on the chosen benchmarks. The study is therefore self-contained against external benchmarks and contains no load-bearing step that reduces to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Weak temporal supervision and short fixed windows suffice to evaluate MLLM reasoning for anomaly detection.
Forward citations
Cited by 1 Pith paper
-
From Frames to Events: Rethinking Evaluation in Human-Centric Video Anomaly Detection
State-of-the-art pose-based video anomaly detection models achieve over 52% frame-level AUC-ROC but drop below 10% event-level precision and 0.11 average F1 when evaluated with temporal action localization metrics on ...
Reference graph
Works this paper leans on
-
[1]
Lvbench: An extreme long video understanding benchmark,
W. Wang, Z. He, W. Hong, Y . Cheng, X. Zhang, J. Qi, M. Ding, X. Gu, S. Huang, B. Xu,et al., “Lvbench: An extreme long video understanding benchmark,” inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22958–22967, 2025
work page 2025
-
[2]
Mmbench-video: A long-form multi-shot benchmark for holistic video understanding,
X. Fang, K. Mao, H. Duan, X. Zhao, Y . Li, D. Lin, and K. Chen, “Mmbench-video: A long-form multi-shot benchmark for holistic video understanding,”Advances in Neural Information Processing Systems, vol. 37, pp. 89098–89124, 2024
work page 2024
-
[3]
A survey on video anomaly detection via deep learning: Human, vehicle, and environment,
G. A. Noghre, A. D. Pazho, and H. Tabkhi, “A survey on video anomaly detection via deep learning: Human, vehicle, and environment,”arXiv preprint arXiv:2508.14203, 2025
-
[4]
Ancilia: Scalable intelligent video surveillance for the artificial intelligence of things,
A. D. Pazho, C. Neff, G. A. Noghre, B. R. Ardabili, S. Yao, M. Baharani, and H. Tabkhi, “Ancilia: Scalable intelligent video surveillance for the artificial intelligence of things,”IEEE Internet of Things Journal, vol. 10, no. 17, pp. 14940–14951, 2023
work page 2023
-
[5]
S. Yao, B. R. Ardabili, A. D. Pazho, G. A. Noghre, C. Neff, L. Bourque, and H. Tabkhi, “From lab to field: Real-world evaluation of an ai-driven smart video solution to enhance community safety,”Internet of Things, p. 101716, 2025
work page 2025
-
[6]
S. Yao, G. A. Noghre, A. D. Pazho, and H. Tabkhi, “Evaluating the effectiveness of video anomaly detection in the wild: Online learning and inference for real-world deployment,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4832– 4841, 2024
work page 2024
-
[7]
S. Yao, G. A. Noghre, A. D. Pazho, and H. Tabkhi, “Alfred: An active learning framework for real-world semi-supervised anomaly detection with adaptive thresholds,”arXiv preprint arXiv:2508.09058, 2025
-
[8]
Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding,
J. Liu, A. Shahroudy, M. Perez, G. Wang, L.-Y . Duan, and A. C. Kot, “Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding,”IEEE transactions on pattern analysis and machine intelligence, vol. 42, no. 10, pp. 2684–2701, 2019
work page 2019
-
[9]
S. Yao, N. Rashvand, A. D. Pazho, and H. Tabkhi, “From offline to periodic adaptation for pose-based shoplifting detection in real-world retail security,”IEEE Internet of Things Journal, 2026
work page 2026
-
[10]
A. D. Pazho, S. Yao, G. A. Noghre, B. R. Ardabili, V . Katariya, and H. Tabkhi, “Towards adaptive human-centric video anomaly detection: A comprehensive framework and a new benchmark,”IEEE Transactions on Circuits and Systems for Video Technology, 2025
work page 2025
-
[11]
Future frame prediction for anomaly detection–a new baseline,
W. Liu, W. Luo, D. Lian, and S. Gao, “Future frame prediction for anomaly detection–a new baseline,” inProceedings of the IEEE conference on computer vision and pattern recognition, pp. 6536–6545, 2018
work page 2018
-
[12]
Chad: Charlotte anomaly dataset,
A. Danesh Pazho, G. Alinezhad Noghre, B. Rahimi Ardabili, C. Neff, and H. Tabkhi, “Chad: Charlotte anomaly dataset,” inScandinavian Conference on Image Analysis, pp. 50–66, Springer, 2023
work page 2023
-
[13]
Human-centric video anomaly detection through spatio-temporal pose tokenization and trans- former,
G. A. Noghre, A. D. Pazho, and H. Tabkhi, “Human-centric video anomaly detection through spatio-temporal pose tokenization and trans- former,”arXiv preprint arXiv:2408.15185, 2024
-
[14]
Shopformer: Transformer-based framework for detecting shoplifting via human pose,
N. Rashvand, G. A. Noghre, A. D. Pazho, B. R. Ardabili, and H. Tabkhi, “Shopformer: Transformer-based framework for detecting shoplifting via human pose,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5761–5770, 2025
work page 2025
-
[15]
Soft computing techniques-based digital video forensics for fraud medical anomaly detection,
S. K. Nanda, D. Ghai, P. Ingole, and S. Pande, “Soft computing techniques-based digital video forensics for fraud medical anomaly detection,”Computer Assisted Methods in Engineering and Science, vol. 30, no. 2, pp. 111–130, 2023
work page 2023
-
[16]
C. Zhao, X. Chang, T. Xie, H. Fujita, and J. Wu, “Unsupervised anomaly detection based method of risk evaluation for road traffic accident: Unsupervised anomaly detection based method of risk evaluation for road traffic accident,”Applied Intelligence, vol. 53, no. 1, pp. 369–384, 2023
work page 2023
-
[17]
P. K. Shahri, A. Rahmanidehkordi, A. Ghaffari, and A. H. Ghasemi, “Enhancing traffic flow in heterogeneous freeways: Integration of mul- tivariable extremum seeking and filtered feedback linearization control,” IEEE Access, vol. 13, pp. 129573–129587, 2025
work page 2025
-
[18]
W. Yu and Q. Huang, “A deep encoder-decoder network for anomaly detection in driving trajectory behavior under spatio-temporal context,” International Journal of Applied Earth Observation and Geoinforma- tion, vol. 115, p. 103115, 2022
work page 2022
-
[19]
Traffic density control for heterogeneous highway systems with input constraints,
A. Rahmanidehkordi and A. H. Ghasemi, “Traffic density control for heterogeneous highway systems with input constraints,”IEEE Control Systems Letters, vol. 8, pp. 2787–2792, 2024
work page 2024
-
[20]
G. A. Noghre, A. D. Pazho, and H. Tabkhi, “An exploratory study on human-centric video anomaly detection through variational autoencoders and trajectory prediction,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops, pp. 995–1004, January 2024
work page 2024
-
[21]
N. Rashvand, G. A. Noghre, A. D. Pazho, S. Yao, and H. Tabkhi, “Exploring pose-based anomaly detection for retail security: A real- world shoplifting dataset and benchmark,” inProceedings of the Winter Conference on Applications of Computer Vision (WACV) Workshops, pp. 1123–1131, February 2025
work page 2025
-
[22]
K.-W. Cheng, Y .-T. Chen, and W.-H. Fang, “Gaussian process regression-based video anomaly detection and localization with hierar- chical feature representation,”IEEE Transactions on Image Processing, vol. 24, no. 12, pp. 5288–5301, 2015
work page 2015
-
[23]
Toward abnormal trajectory and event detection in video surveillance,
S. Cos ¸ar, G. Donatiello, V . Bogorny, C. Garate, L. O. Alvares, and F. Br´emond, “Toward abnormal trajectory and event detection in video surveillance,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 27, no. 3, pp. 683–695, 2016
work page 2016
-
[24]
Weakly-supervised video anomaly detection with robust temporal fea- ture magnitude learning,
Y . Tian, G. Pang, Y . Chen, R. Singh, J. W. Verjans, and G. Carneiro, “Weakly-supervised video anomaly detection with robust temporal fea- ture magnitude learning,” inProceedings of the IEEE/CVF international conference on computer vision, pp. 4975–4986, 2021
work page 2021
-
[25]
Q. Li, X. Pan, F. Xiao, and B. Bhanu, “Essl: Enhanced spatio-temporal self-selective learning framework for unsupervised video anomaly de- tection.,” inECAI, pp. 1398–1405, 2023
work page 2023
-
[26]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017
work page 2017
-
[27]
Detection of shoplifting on video using a hybrid network,
L. Kirichenko, T. Radivilova, B. Sydorenko, and S. Yakovlev, “Detection of shoplifting on video using a hybrid network,”Computation, vol. 10, no. 11, p. 199, 2022
work page 2022
-
[28]
G. A. Mart ´ınez-Mascorro, J. R. Abreu-Pederzini, J. C. Ortiz-Bayliss, A. Garcia-Collantes, and H. Terashima-Mar ´ın, “Criminal intention de- tection at early stages of shoplifting cases by using 3d convolutional neural networks,”Computation, vol. 9, no. 2, p. 24, 2021
work page 2021
-
[29]
Normalizing flows for human pose anomaly detection,
O. Hirschorn and S. Avidan, “Normalizing flows for human pose anomaly detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13545–13554, 2023
work page 2023
-
[30]
Regularity learning via explicit distribution modeling for skeletal video anomaly detection,
S. Yu, Z. Zhao, H. Fang, A. Deng, H. Su, D. Wang, W. Gan, C. Lu, and W. Wu, “Regularity learning via explicit distribution modeling for skeletal video anomaly detection,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 8, pp. 6661–6673, 2023
work page 2023
-
[31]
Graph embedded pose clustering for anomaly detection,
A. Markovitz, G. Sharir, I. Friedman, L. Zelnik-Manor, and S. Avidan, “Graph embedded pose clustering for anomaly detection,” inProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10539–10547, 2020
work page 2020
-
[32]
Large Language Models: A Survey
S. Minaee, T. Mikolov, N. Nikzad, M. Chenaghlu, R. Socher, X. Am- atriain, and J. Gao, “Large language models: A survey,”arXiv preprint arXiv:2402.06196, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
A survey of gpt-3 family large language models including chatgpt and gpt-4,
K. S. Kalyan, “A survey of gpt-3 family large language models including chatgpt and gpt-4,”Natural Language Processing Journal, vol. 6, p. 100048, 2024
work page 2024
-
[34]
Quantifying influencer impact on affective polarization,
R. Rashid, J. Melton, O. Ghorbani, S. Krishnan, S. Reid, and G. Tere- janu, “Quantifying influencer impact on affective polarization,” in 2024 International Conference on Machine Learning and Applications (ICMLA), pp. 1135–1140, 2024
work page 2024
-
[35]
A survey on large language models from concept to implementation,
C. Wang, J. Zhao, and J. Gong, “A survey on large language models from concept to implementation,”arXiv preprint arXiv:2403.18969, 2024
-
[36]
Examining radiation therapy planning knowledge in large language models,
O. Ghorbani, A. Helmy, Q. J. Wu, and Y . Ge, “Examining radiation therapy planning knowledge in large language models,” inProceedings of the 16th ACM International Conference on Bioinformatics, Compu- tational Biology, and Health Informatics, pp. 1–1, 2025
work page 2025
-
[37]
Mm- llms: Recent advances in multimodal large language models,
D. Zhang, Y . Yu, J. Dong, C. Li, D. Su, C. Chu, and D. Yu, “Mm- llms: Recent advances in multimodal large language models,”Findings of the Association for Computational Linguistics: ACL 2024, pp. 12401– 12430, 2024
work page 2024
-
[38]
A survey on multimodal large language models,
S. Yin, C. Fu, S. Zhao, K. Li, X. Sun, T. Xu, and E. Chen, “A survey on multimodal large language models,”National Science Review, vol. 11, no. 12, p. nwae403, 2024
work page 2024
-
[39]
A systematic review of multi-modal large language models on domain-specific applications,
S. Li, K. W. Wong, G. Wang, and T.-T. Duong, “A systematic review of multi-modal large language models on domain-specific applications,” Artificial Intelligence Review, vol. 58, no. 12, pp. 1–47, 2025
work page 2025
-
[40]
Pre-trained video generative models as world simulators,
H. He, Y . Zhang, L. Lin, Z. Xu, and L. Pan, “Pre-trained video generative models as world simulators,”arXiv preprint arXiv:2502.07825, 2025
-
[41]
VideoPoet: A Large Language Model for Zero-Shot Video Generation
D. Kondratyuk, L. Yu, X. Gu, J. Lezama, J. Huang, G. Schindler, R. Hornung, V . Birodkar, J. Yan, M.-C. Chiu,et al., “Videopoet: A large language model for zero-shot video generation,”arXiv preprint arXiv:2312.14125, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[42]
Make-A-Video: Text-to-Video Generation without Text-Video Data
U. Singer, A. Polyak, T. Hayes, X. Yin, J. An, S. Zhang, Q. Hu, H. Yang, O. Ashual, O. Gafni,et al., “Make-a-video: Text-to-video generation without text-video data,”arXiv preprint arXiv:2209.14792, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[43]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
G. Team, P. Georgiev, V . I. Lei, R. Burnell, L. Bai, A. Gulati, G. Tanzer, D. Vincent, Z. Pan, S. Wang,et al., “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,”arXiv preprint arXiv:2403.05530, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[44]
A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford,et al., “Gpt-4o system card,”arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[45]
Video-bench: Human-aligned video generation benchmark,
H. Han, S. Li, J. Chen, Y . Yuan, Y . Wu, Y . Deng, C. T. Leong, H. Du, J. Fu, Y . Li,et al., “Video-bench: Human-aligned video generation benchmark,” inProceedings of the Computer Vision and Pattern Recog- nition Conference, pp. 18858–18868, 2025
work page 2025
-
[46]
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness
D. Zheng, Z. Huang, H. Liu, K. Zou, Y . He, F. Zhang, L. Gu, Y . Zhang, J. He, W.-S. Zheng,et al., “Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness,”arXiv preprint arXiv:2503.21755, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis,
C. Fu, Y . Dai, Y . Luo, L. Li, S. Ren, R. Zhang, Z. Wang, C. Zhou, Y . Shen, M. Zhang,et al., “Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis,” inPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 24108–24118, 2025
work page 2025
-
[48]
Tempcompass: Do video llms really understand videos?,
Y . Liu, S. Li, Y . Liu, Y . Wang, S. Ren, L. Li, S. Chen, X. Sun, and L. Hou, “Tempcompass: Do video llms really understand videos?,” in Findings of the Association for Computational Linguistics: ACL 2024, pp. 8731–8772, 2024
work page 2024
-
[49]
Ucvl: a benchmark for crime surveillance video analysis with large models,
H. Chen, D. Yi, M. Cao, C. Huang, G. Zhu, and J. Wang, “Ucvl: a benchmark for crime surveillance video analysis with large models,” in2025 5th International Conference on Artificial Intelligence and Industrial Technology Applications (AIITA), pp. 2051–2057, IEEE, 2025
work page 2051
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.