pith. sign in

arxiv: 2603.04727 · v2 · pith:X6T6AVYOnew · submitted 2026-03-05 · 💻 cs.CV · cs.AI

Are Multimodal LLMs Ready for Surveillance? A Reality Check on Zero-Shot Anomaly Detection in the Wild

Pith reviewed 2026-05-21 12:18 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords multimodal LLMsvideo anomaly detectionzero-shot learningsurveillanceprecision-recall tradeoffprompt engineering
0
0 comments X

The pith

Multimodal LLMs exhibit conservative bias in zero-shot video anomaly detection, favoring normal labels and collapsing recall.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates state-of-the-art multimodal LLMs on surveillance video anomaly detection by recasting the task as prompt-based binary classification under weak temporal supervision. It reveals that zero-shot inference produces high confidence but strongly favors the normal class, yielding high precision while recall drops sharply and limits usefulness. Class-specific instructions can move the decision boundary and lift peak F1 on ShanghaiTech from 0.09 to 0.64, yet recall stays the main constraint. A reader would care because real surveillance requires catching rare events reliably, and this work maps where current general-purpose models fall short in noisy conditions.

Core claim

Multimodal large language models can treat video anomaly detection as a language-guided binary classification task, but in zero-shot settings they display a pronounced conservative bias that favors the normal class, producing high precision at the cost of recall collapse. Class-specific instructions shift this boundary and raise the peak F1-score on ShanghaiTech from 0.09 to 0.64, although recall remains a critical bottleneck. The results point to a sizable performance gap for MLLMs in the noisy, open-world environments of actual surveillance.

What carries the argument

Reformulating video anomaly detection as a prompt-driven binary classification task inside multimodal LLMs, using varying prompt specificity and 1-3 second temporal windows to expose and adjust the normal-versus-anomalous decision boundary.

If this is right

  • Class-specific instructions can shift the decision boundary and raise peak F1-scores substantially on standard benchmarks.
  • Recall remains the dominant practical limitation even after prompt adjustments.
  • MLLMs exhibit a clear performance gap when faced with the noise typical of open-world surveillance.
  • Recall-oriented prompting and model calibration are required to close the gap for real deployments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same bias may appear in other rare-event detection settings such as medical video or industrial monitoring where missing positives is costly.
  • Hybrid pipelines that combine MLLM reasoning with traditional reconstruction cues could address the recall shortfall without full retraining.
  • Investigating the training data distribution that produces default normal predictions could guide more balanced pre-training for safety-critical uses.

Load-bearing premise

The assumption that results on the ShanghaiTech and CHAD benchmarks under weak temporal supervision and chosen prompt styles generalize to the noisy, open-world conditions of actual surveillance deployments.

What would settle it

Running the same models on additional uncurated real-time surveillance streams containing varied noise levels and anomalies, then checking whether recall stays near zero or rises measurably.

Figures

Figures reproduced from arXiv: 2603.04727 by Armin Danesh Pazho, Hamed Tabkhi, Narges Rashvand, Shanle Yao.

Figure 1
Figure 1. Figure 1: Conceputal Overview increasingly rely on automated monitoring, detection systems must not only “see motion,” but also reason about intent, context, and risk. In principle, MLLMs are well-positioned for this: their semantic priors and language interface could enable more interpretable detection decisions, richer explanations, and scalable annotation, capabilities that classical reconstruction, or pose-based… view at source ↗
Figure 2
Figure 2. Figure 2: System architecture for prompt-based video anomaly detection. The workflow illustrates the transformation of raw input [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
read the original abstract

Multimodal large language models (MLLMs) have demonstrated impressive general competence in video understanding, yet their reliability for real-world Video Anomaly Detection (VAD) remains largely unexplored. Unlike conventional pipelines relying on reconstruction or pose-based cues, MLLMs enable a paradigm shift: treating anomaly detection as a language-guided reasoning task. In this work, we systematically evaluate state-of-the-art MLLMs on the ShanghaiTech and CHAD benchmarks by reformulating VAD as a binary classification task under weak temporal supervision. We investigate how prompt specificity and temporal window lengths (1s--3s) influence performance, focusing on the precision--recall trade-off. Our findings reveal a pronounced conservative bias in zero-shot settings; while models exhibit high confidence, they disproportionately favor the 'normal' class, resulting in high precision but a recall collapse that limits practical utility. We demonstrate that class-specific instructions can significantly shift this decision boundary, improving the peak F1-score on ShanghaiTech from 0.09 to 0.64, yet recall remains a critical bottleneck. These results highlight a significant performance gap for MLLMs in noisy environments and provide a foundation for future work in recall-oriented prompting and model calibration for open-world surveillance, which demands complex video understanding and reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper systematically evaluates state-of-the-art multimodal LLMs on zero-shot anomaly detection in surveillance videos by reformulating VAD as a binary classification task under weak temporal supervision on the ShanghaiTech and CHAD benchmarks. It investigates the influence of prompt specificity and temporal window lengths on the precision-recall trade-off, revealing a conservative bias in zero-shot settings with high precision but recall collapse. Class-specific instructions are shown to improve peak F1-score on ShanghaiTech from 0.09 to 0.64, though recall remains a bottleneck, highlighting performance gaps for practical open-world surveillance.

Significance. This work is significant as it provides an empirical reality check on the readiness of MLLMs for real-world surveillance applications, a domain where traditional VAD methods have been dominant. By demonstrating the conservative bias and the potential of prompt engineering to shift decision boundaries, it opens avenues for recall-oriented prompting and model calibration. If validated with more robust experimental protocols, these findings could influence the design of future MLLM-based systems for video understanding in noisy environments.

major comments (2)
  1. [Section 4 (Experiments)] The peak F1-score of 0.64 achieved with class-specific instructions on ShanghaiTech is reported without accompanying error bars, number of independent trials, or statistical significance measures. This undermines confidence in the claimed improvement from 0.09, as it is unclear if the result is robust to variations in prompt phrasing or random seeds.
  2. [Section 5 (Discussion)] The assertion that recall is a critical bottleneck limiting practical utility in surveillance relies on the representativeness of ShanghaiTech and CHAD under weak temporal supervision and 1-3s windows. No additional results on more challenging conditions such as variable lighting, camera motion, or long untrimmed videos are provided to support generalization to open-world deployments.
minor comments (2)
  1. [Abstract] The abstract refers to 'full prompt templates' but these are not included in the main text or supplementary material, which would aid reproducibility.
  2. [Related Work] Consider adding more recent references on MLLM applications in video anomaly detection to better contextualize the contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which has helped us strengthen the experimental reporting and better delineate the scope of our claims. We address each major comment below and indicate the corresponding revisions to the manuscript.

read point-by-point responses
  1. Referee: [Section 4 (Experiments)] The peak F1-score of 0.64 achieved with class-specific instructions on ShanghaiTech is reported without accompanying error bars, number of independent trials, or statistical significance measures. This undermines confidence in the claimed improvement from 0.09, as it is unclear if the result is robust to variations in prompt phrasing or random seeds.

    Authors: We agree that the absence of variability measures and statistical tests limits the strength of the reported improvement. In the revised manuscript we have rerun the class-specific prompting experiments on ShanghaiTech across five independent trials that incorporate both prompt paraphrasing and any model stochasticity. We now report mean peak F1 scores together with standard deviations and include a paired t-test confirming that the gain from 0.09 to 0.64 is statistically significant (p < 0.05). These updates appear in the revised Section 4 and the associated tables. revision: yes

  2. Referee: [Section 5 (Discussion)] The assertion that recall is a critical bottleneck limiting practical utility in surveillance relies on the representativeness of ShanghaiTech and CHAD under weak temporal supervision and 1-3s windows. No additional results on more challenging conditions such as variable lighting, camera motion, or long untrimmed videos are provided to support generalization to open-world deployments.

    Authors: We acknowledge that direct evidence on additional stressors would further support generalization claims. ShanghaiTech and CHAD remain the standard benchmarks for this task and already contain substantial scene diversity; our controlled setting with weak supervision and short windows was chosen to isolate the effect of prompt specificity. We have expanded the Discussion to explicitly state that the conservative bias and recall collapse we observe are likely to be at least as severe under variable lighting, camera motion, or long untrimmed sequences, and we list these conditions as important directions for future work. No new experimental results on those conditions are added in the current revision. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmarking with direct model evaluations on public datasets

full rationale

The paper performs an empirical evaluation of MLLMs on ShanghaiTech and CHAD by reformulating VAD as binary classification under weak temporal supervision, testing prompt variants and window lengths, and reporting observed precision-recall trade-offs and F1 improvements. No derivation chain, equations, fitted parameters renamed as predictions, or self-citations are used to justify any central result; all reported numbers are direct outputs from running the models on the chosen benchmarks. The study is therefore self-contained against external benchmarks and contains no load-bearing step that reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard computer-vision benchmarks and the modeling choice to treat VAD as prompted binary classification; no free parameters, invented entities, or non-standard axioms are introduced.

axioms (1)
  • domain assumption Weak temporal supervision and short fixed windows suffice to evaluate MLLM reasoning for anomaly detection.
    The paper reformulates VAD as binary classification under weak temporal supervision to test the models.

pith-pipeline@v0.9.0 · 5776 in / 1319 out tokens · 52626 ms · 2026-05-21T12:18:17.296039+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. From Frames to Events: Rethinking Evaluation in Human-Centric Video Anomaly Detection

    cs.CV 2026-04 conditional novelty 8.0

    State-of-the-art pose-based video anomaly detection models achieve over 52% frame-level AUC-ROC but drop below 10% event-level precision and 0.11 average F1 when evaluated with temporal action localization metrics on ...

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · cited by 1 Pith paper · 6 internal anchors

  1. [1]

    Lvbench: An extreme long video understanding benchmark,

    W. Wang, Z. He, W. Hong, Y . Cheng, X. Zhang, J. Qi, M. Ding, X. Gu, S. Huang, B. Xu,et al., “Lvbench: An extreme long video understanding benchmark,” inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22958–22967, 2025

  2. [2]

    Mmbench-video: A long-form multi-shot benchmark for holistic video understanding,

    X. Fang, K. Mao, H. Duan, X. Zhao, Y . Li, D. Lin, and K. Chen, “Mmbench-video: A long-form multi-shot benchmark for holistic video understanding,”Advances in Neural Information Processing Systems, vol. 37, pp. 89098–89124, 2024

  3. [3]

    A survey on video anomaly detection via deep learning: Human, vehicle, and environment,

    G. A. Noghre, A. D. Pazho, and H. Tabkhi, “A survey on video anomaly detection via deep learning: Human, vehicle, and environment,”arXiv preprint arXiv:2508.14203, 2025

  4. [4]

    Ancilia: Scalable intelligent video surveillance for the artificial intelligence of things,

    A. D. Pazho, C. Neff, G. A. Noghre, B. R. Ardabili, S. Yao, M. Baharani, and H. Tabkhi, “Ancilia: Scalable intelligent video surveillance for the artificial intelligence of things,”IEEE Internet of Things Journal, vol. 10, no. 17, pp. 14940–14951, 2023

  5. [5]

    From lab to field: Real-world evaluation of an ai-driven smart video solution to enhance community safety,

    S. Yao, B. R. Ardabili, A. D. Pazho, G. A. Noghre, C. Neff, L. Bourque, and H. Tabkhi, “From lab to field: Real-world evaluation of an ai-driven smart video solution to enhance community safety,”Internet of Things, p. 101716, 2025

  6. [6]

    Evaluating the effectiveness of video anomaly detection in the wild: Online learning and inference for real-world deployment,

    S. Yao, G. A. Noghre, A. D. Pazho, and H. Tabkhi, “Evaluating the effectiveness of video anomaly detection in the wild: Online learning and inference for real-world deployment,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4832– 4841, 2024

  7. [7]

    Alfred: An active learning framework for real-world semi-supervised anomaly detection with adaptive thresholds,

    S. Yao, G. A. Noghre, A. D. Pazho, and H. Tabkhi, “Alfred: An active learning framework for real-world semi-supervised anomaly detection with adaptive thresholds,”arXiv preprint arXiv:2508.09058, 2025

  8. [8]

    Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding,

    J. Liu, A. Shahroudy, M. Perez, G. Wang, L.-Y . Duan, and A. C. Kot, “Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding,”IEEE transactions on pattern analysis and machine intelligence, vol. 42, no. 10, pp. 2684–2701, 2019

  9. [9]

    From offline to periodic adaptation for pose-based shoplifting detection in real-world retail security,

    S. Yao, N. Rashvand, A. D. Pazho, and H. Tabkhi, “From offline to periodic adaptation for pose-based shoplifting detection in real-world retail security,”IEEE Internet of Things Journal, 2026

  10. [10]

    Towards adaptive human-centric video anomaly detection: A comprehensive framework and a new benchmark,

    A. D. Pazho, S. Yao, G. A. Noghre, B. R. Ardabili, V . Katariya, and H. Tabkhi, “Towards adaptive human-centric video anomaly detection: A comprehensive framework and a new benchmark,”IEEE Transactions on Circuits and Systems for Video Technology, 2025

  11. [11]

    Future frame prediction for anomaly detection–a new baseline,

    W. Liu, W. Luo, D. Lian, and S. Gao, “Future frame prediction for anomaly detection–a new baseline,” inProceedings of the IEEE conference on computer vision and pattern recognition, pp. 6536–6545, 2018

  12. [12]

    Chad: Charlotte anomaly dataset,

    A. Danesh Pazho, G. Alinezhad Noghre, B. Rahimi Ardabili, C. Neff, and H. Tabkhi, “Chad: Charlotte anomaly dataset,” inScandinavian Conference on Image Analysis, pp. 50–66, Springer, 2023

  13. [13]

    Human-centric video anomaly detection through spatio-temporal pose tokenization and trans- former,

    G. A. Noghre, A. D. Pazho, and H. Tabkhi, “Human-centric video anomaly detection through spatio-temporal pose tokenization and trans- former,”arXiv preprint arXiv:2408.15185, 2024

  14. [14]

    Shopformer: Transformer-based framework for detecting shoplifting via human pose,

    N. Rashvand, G. A. Noghre, A. D. Pazho, B. R. Ardabili, and H. Tabkhi, “Shopformer: Transformer-based framework for detecting shoplifting via human pose,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5761–5770, 2025

  15. [15]

    Soft computing techniques-based digital video forensics for fraud medical anomaly detection,

    S. K. Nanda, D. Ghai, P. Ingole, and S. Pande, “Soft computing techniques-based digital video forensics for fraud medical anomaly detection,”Computer Assisted Methods in Engineering and Science, vol. 30, no. 2, pp. 111–130, 2023

  16. [16]

    Unsupervised anomaly detection based method of risk evaluation for road traffic accident: Unsupervised anomaly detection based method of risk evaluation for road traffic accident,

    C. Zhao, X. Chang, T. Xie, H. Fujita, and J. Wu, “Unsupervised anomaly detection based method of risk evaluation for road traffic accident: Unsupervised anomaly detection based method of risk evaluation for road traffic accident,”Applied Intelligence, vol. 53, no. 1, pp. 369–384, 2023

  17. [17]

    Enhancing traffic flow in heterogeneous freeways: Integration of mul- tivariable extremum seeking and filtered feedback linearization control,

    P. K. Shahri, A. Rahmanidehkordi, A. Ghaffari, and A. H. Ghasemi, “Enhancing traffic flow in heterogeneous freeways: Integration of mul- tivariable extremum seeking and filtered feedback linearization control,” IEEE Access, vol. 13, pp. 129573–129587, 2025

  18. [18]

    A deep encoder-decoder network for anomaly detection in driving trajectory behavior under spatio-temporal context,

    W. Yu and Q. Huang, “A deep encoder-decoder network for anomaly detection in driving trajectory behavior under spatio-temporal context,” International Journal of Applied Earth Observation and Geoinforma- tion, vol. 115, p. 103115, 2022

  19. [19]

    Traffic density control for heterogeneous highway systems with input constraints,

    A. Rahmanidehkordi and A. H. Ghasemi, “Traffic density control for heterogeneous highway systems with input constraints,”IEEE Control Systems Letters, vol. 8, pp. 2787–2792, 2024

  20. [20]

    An exploratory study on human-centric video anomaly detection through variational autoencoders and trajectory prediction,

    G. A. Noghre, A. D. Pazho, and H. Tabkhi, “An exploratory study on human-centric video anomaly detection through variational autoencoders and trajectory prediction,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops, pp. 995–1004, January 2024

  21. [21]

    Exploring pose-based anomaly detection for retail security: A real- world shoplifting dataset and benchmark,

    N. Rashvand, G. A. Noghre, A. D. Pazho, S. Yao, and H. Tabkhi, “Exploring pose-based anomaly detection for retail security: A real- world shoplifting dataset and benchmark,” inProceedings of the Winter Conference on Applications of Computer Vision (WACV) Workshops, pp. 1123–1131, February 2025

  22. [22]

    Gaussian process regression-based video anomaly detection and localization with hierar- chical feature representation,

    K.-W. Cheng, Y .-T. Chen, and W.-H. Fang, “Gaussian process regression-based video anomaly detection and localization with hierar- chical feature representation,”IEEE Transactions on Image Processing, vol. 24, no. 12, pp. 5288–5301, 2015

  23. [23]

    Toward abnormal trajectory and event detection in video surveillance,

    S. Cos ¸ar, G. Donatiello, V . Bogorny, C. Garate, L. O. Alvares, and F. Br´emond, “Toward abnormal trajectory and event detection in video surveillance,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 27, no. 3, pp. 683–695, 2016

  24. [24]

    Weakly-supervised video anomaly detection with robust temporal fea- ture magnitude learning,

    Y . Tian, G. Pang, Y . Chen, R. Singh, J. W. Verjans, and G. Carneiro, “Weakly-supervised video anomaly detection with robust temporal fea- ture magnitude learning,” inProceedings of the IEEE/CVF international conference on computer vision, pp. 4975–4986, 2021

  25. [25]

    Essl: Enhanced spatio-temporal self-selective learning framework for unsupervised video anomaly de- tection.,

    Q. Li, X. Pan, F. Xiao, and B. Bhanu, “Essl: Enhanced spatio-temporal self-selective learning framework for unsupervised video anomaly de- tection.,” inECAI, pp. 1398–1405, 2023

  26. [26]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

  27. [27]

    Detection of shoplifting on video using a hybrid network,

    L. Kirichenko, T. Radivilova, B. Sydorenko, and S. Yakovlev, “Detection of shoplifting on video using a hybrid network,”Computation, vol. 10, no. 11, p. 199, 2022

  28. [28]

    Criminal intention de- tection at early stages of shoplifting cases by using 3d convolutional neural networks,

    G. A. Mart ´ınez-Mascorro, J. R. Abreu-Pederzini, J. C. Ortiz-Bayliss, A. Garcia-Collantes, and H. Terashima-Mar ´ın, “Criminal intention de- tection at early stages of shoplifting cases by using 3d convolutional neural networks,”Computation, vol. 9, no. 2, p. 24, 2021

  29. [29]

    Normalizing flows for human pose anomaly detection,

    O. Hirschorn and S. Avidan, “Normalizing flows for human pose anomaly detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13545–13554, 2023

  30. [30]

    Regularity learning via explicit distribution modeling for skeletal video anomaly detection,

    S. Yu, Z. Zhao, H. Fang, A. Deng, H. Su, D. Wang, W. Gan, C. Lu, and W. Wu, “Regularity learning via explicit distribution modeling for skeletal video anomaly detection,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 8, pp. 6661–6673, 2023

  31. [31]

    Graph embedded pose clustering for anomaly detection,

    A. Markovitz, G. Sharir, I. Friedman, L. Zelnik-Manor, and S. Avidan, “Graph embedded pose clustering for anomaly detection,” inProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10539–10547, 2020

  32. [32]

    Large Language Models: A Survey

    S. Minaee, T. Mikolov, N. Nikzad, M. Chenaghlu, R. Socher, X. Am- atriain, and J. Gao, “Large language models: A survey,”arXiv preprint arXiv:2402.06196, 2024

  33. [33]

    A survey of gpt-3 family large language models including chatgpt and gpt-4,

    K. S. Kalyan, “A survey of gpt-3 family large language models including chatgpt and gpt-4,”Natural Language Processing Journal, vol. 6, p. 100048, 2024

  34. [34]

    Quantifying influencer impact on affective polarization,

    R. Rashid, J. Melton, O. Ghorbani, S. Krishnan, S. Reid, and G. Tere- janu, “Quantifying influencer impact on affective polarization,” in 2024 International Conference on Machine Learning and Applications (ICMLA), pp. 1135–1140, 2024

  35. [35]

    A survey on large language models from concept to implementation,

    C. Wang, J. Zhao, and J. Gong, “A survey on large language models from concept to implementation,”arXiv preprint arXiv:2403.18969, 2024

  36. [36]

    Examining radiation therapy planning knowledge in large language models,

    O. Ghorbani, A. Helmy, Q. J. Wu, and Y . Ge, “Examining radiation therapy planning knowledge in large language models,” inProceedings of the 16th ACM International Conference on Bioinformatics, Compu- tational Biology, and Health Informatics, pp. 1–1, 2025

  37. [37]

    Mm- llms: Recent advances in multimodal large language models,

    D. Zhang, Y . Yu, J. Dong, C. Li, D. Su, C. Chu, and D. Yu, “Mm- llms: Recent advances in multimodal large language models,”Findings of the Association for Computational Linguistics: ACL 2024, pp. 12401– 12430, 2024

  38. [38]

    A survey on multimodal large language models,

    S. Yin, C. Fu, S. Zhao, K. Li, X. Sun, T. Xu, and E. Chen, “A survey on multimodal large language models,”National Science Review, vol. 11, no. 12, p. nwae403, 2024

  39. [39]

    A systematic review of multi-modal large language models on domain-specific applications,

    S. Li, K. W. Wong, G. Wang, and T.-T. Duong, “A systematic review of multi-modal large language models on domain-specific applications,” Artificial Intelligence Review, vol. 58, no. 12, pp. 1–47, 2025

  40. [40]

    Pre-trained video generative models as world simulators,

    H. He, Y . Zhang, L. Lin, Z. Xu, and L. Pan, “Pre-trained video generative models as world simulators,”arXiv preprint arXiv:2502.07825, 2025

  41. [41]

    VideoPoet: A Large Language Model for Zero-Shot Video Generation

    D. Kondratyuk, L. Yu, X. Gu, J. Lezama, J. Huang, G. Schindler, R. Hornung, V . Birodkar, J. Yan, M.-C. Chiu,et al., “Videopoet: A large language model for zero-shot video generation,”arXiv preprint arXiv:2312.14125, 2023

  42. [42]

    Make-A-Video: Text-to-Video Generation without Text-Video Data

    U. Singer, A. Polyak, T. Hayes, X. Yin, J. An, S. Zhang, Q. Hu, H. Yang, O. Ashual, O. Gafni,et al., “Make-a-video: Text-to-video generation without text-video data,”arXiv preprint arXiv:2209.14792, 2022

  43. [43]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    G. Team, P. Georgiev, V . I. Lei, R. Burnell, L. Bai, A. Gulati, G. Tanzer, D. Vincent, Z. Pan, S. Wang,et al., “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,”arXiv preprint arXiv:2403.05530, 2024

  44. [44]

    GPT-4o System Card

    A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford,et al., “Gpt-4o system card,”arXiv preprint arXiv:2410.21276, 2024

  45. [45]

    Video-bench: Human-aligned video generation benchmark,

    H. Han, S. Li, J. Chen, Y . Yuan, Y . Wu, Y . Deng, C. T. Leong, H. Du, J. Fu, Y . Li,et al., “Video-bench: Human-aligned video generation benchmark,” inProceedings of the Computer Vision and Pattern Recog- nition Conference, pp. 18858–18868, 2025

  46. [46]

    VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

    D. Zheng, Z. Huang, H. Liu, K. Zou, Y . He, F. Zhang, L. Gu, Y . Zhang, J. He, W.-S. Zheng,et al., “Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness,”arXiv preprint arXiv:2503.21755, 2025

  47. [47]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis,

    C. Fu, Y . Dai, Y . Luo, L. Li, S. Ren, R. Zhang, Z. Wang, C. Zhou, Y . Shen, M. Zhang,et al., “Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis,” inPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 24108–24118, 2025

  48. [48]

    Tempcompass: Do video llms really understand videos?,

    Y . Liu, S. Li, Y . Liu, Y . Wang, S. Ren, L. Li, S. Chen, X. Sun, and L. Hou, “Tempcompass: Do video llms really understand videos?,” in Findings of the Association for Computational Linguistics: ACL 2024, pp. 8731–8772, 2024

  49. [49]

    Ucvl: a benchmark for crime surveillance video analysis with large models,

    H. Chen, D. Yi, M. Cao, C. Huang, G. Zhu, and J. Wang, “Ucvl: a benchmark for crime surveillance video analysis with large models,” in2025 5th International Conference on Artificial Intelligence and Industrial Technology Applications (AIITA), pp. 2051–2057, IEEE, 2025