pith. sign in

arxiv: 2509.11058 · v2 · submitted 2025-09-14 · 💻 cs.CV

Action Hints: Semantic Typicality and Context Uniqueness for Generalizable Skeleton-based Video Anomaly Detection

Pith reviewed 2026-05-18 17:11 UTC · model grok-4.3

classification 💻 cs.CV
keywords zero-shot video anomaly detectionskeleton-based detectionsemantic typicalitycontext uniquenessgeneralizable VADLLM knowledge distillationsurveillance video
0
0 comments X p. Extension

The pith

Skeleton-based zero-shot anomaly detection generalizes to new scenes by projecting movements into semantic space with language guidance and adapting boundaries via context uniqueness at test time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to create a zero-shot framework for spotting anomalies in surveillance videos that works on entirely new scenes without any training samples from those scenes. It starts by mapping short skeleton motion clips into an action semantic space using language prompts, pulling in knowledge from a large language model about what counts as typical normal or abnormal behavior. At inference time on a fresh scene, the system measures how unique each clip's spatial and temporal context is compared to others to set custom decision thresholds. This matters for real deployments because it sidesteps privacy issues and the cost of labeling new data while still handling the fact that normal and abnormal actions vary widely across different locations.

Core claim

The paper claims that skeleton data can be unlocked for generalizable zero-shot video anomaly detection through action typicality and uniqueness learning. The language-guided semantic typicality modeling module projects skeleton snippets into action semantic space and distills LLM knowledge of typical normal and abnormal behaviors. The test-time context uniqueness analysis module then examines spatio-temporal differences between snippets to derive scene-adaptive boundaries, delivering state-of-the-art results against other skeleton methods on ShanghaiTech, UBnormal, NWPU, and UCF-Crime without any target-domain training samples.

What carries the argument

Language-guided semantic typicality modeling module that projects skeleton snippets into action semantic space to distill typical behavior knowledge, combined with a test-time context uniqueness analysis module that derives scene-adaptive boundaries from spatio-temporal differences.

If this is right

  • New surveillance systems can be deployed immediately without collecting or annotating local training videos.
  • The approach removes dependence on a single fixed normality boundary learned from limited source domains.
  • Skeleton representations avoid the background and appearance shifts that hurt RGB-based zero-shot methods.
  • Results improve over existing skeleton techniques across four large datasets spanning more than 100 unseen scenes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same language-guided projection step could be tested on other skeleton-driven tasks such as cross-view action recognition to check for similar generalization gains.
  • Combining geometric skeleton input with linguistic priors points toward hybrid systems that might handle rare or culturally specific anomalies better than purely data-driven baselines.
  • Live camera feeds could be used to measure how quickly the test-time uniqueness module stabilizes its adaptive boundaries in changing environments.

Load-bearing premise

Distilling an LLM's knowledge of typical normal and abnormal behaviors through language-guided projection onto skeleton snippets will produce representations that still separate normal from abnormal actions in new scenes whose behavior patterns differ from the training distribution.

What would settle it

Performance would fall below prior skeleton-based methods on a new collection of surveillance scenes where everyday normal actions, such as region-specific walking styles or work routines, differ markedly from patterns the language model associates with typical behavior.

Figures

Figures reproduced from arXiv: 2509.11058 by Canhui Tang, Haoyue Shi, Le Wang, Sanping Zhou.

Figure 1
Figure 1. Figure 1: An illustration of skeleton-based VAD paradigm comparison. Previous [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our approach for skeleton-based zero-shot video anomaly detection. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Example results of our method that succeed in capturing typical anomalies. STG-NF [ [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Example results of our method that succeed in capturing unique anomalies. STG-NF misclassifies unseen normal events during periods where “riding” [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
read the original abstract

Zero-Shot Video Anomaly Detection (ZS-VAD) requires temporally localizing anomalies without target domain training data, which is a crucial task due to various practical concerns, e.g., data privacy or new surveillance deployments. Skeleton-based approach has inherent generalizable advantages in achieving ZS-VAD as it eliminates domain disparities both in background and human appearance. However, existing methods only learn low-level skeleton representation and rely on the domain-limited normality boundary, which cannot generalize well to new scenes with different normal and abnormal behavior patterns. In this paper, we propose a novel zero-shot video anomaly detection framework, unlocking the potential of skeleton data via action typicality and uniqueness learning. Firstly, we introduce a language-guided semantic typicality modeling module that projects skeleton snippets into action semantic space and distills LLM's knowledge of typical normal and abnormal behaviors during training. Secondly, we propose a test-time context uniqueness analysis module to finely analyze the spatio-temporal differences between skeleton snippets and then derive scene-adaptive boundaries. Without using any training samples from the target domain, our method achieves state-of-the-art results against skeleton-based methods on four large-scale VAD datasets: ShanghaiTech, UBnormal, NWPU, and UCF-Crime, featuring over 100 unseen surveillance scenes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper proposes a zero-shot skeleton-based video anomaly detection (ZS-VAD) framework that uses a language-guided semantic typicality modeling module to project skeleton snippets into an action semantic space and distill LLM knowledge of typical normal/abnormal behaviors, combined with a test-time context uniqueness analysis module to derive scene-adaptive normality boundaries. It claims state-of-the-art performance against skeleton-based methods on four large-scale datasets (ShanghaiTech, UBnormal, NWPU, UCF-Crime) covering over 100 unseen scenes, without any target-domain training samples.

Significance. If the central empirical claims are supported by the experiments, the work offers a promising direction for generalizable ZS-VAD by moving beyond low-level skeleton features to semantic priors from LLMs, potentially enabling deployment in new surveillance settings where target data is unavailable due to privacy or logistical constraints. The dual focus on typicality and uniqueness provides a concrete mechanism for cross-scene transfer that could influence future skeleton-based and multimodal anomaly detection research.

major comments (3)
  1. [§3.1] §3.1 (language-guided semantic typicality modeling): The projection of skeleton snippets into action semantic space and distillation of LLM priors on normal/abnormal behaviors is central to the zero-shot claim, yet the manuscript provides no quantitative measure of semantic coverage or cross-scene action overlap (e.g., between ShanghaiTech and UCF-Crime distributions); without this, it remains unclear whether the learned typicality distinctions remain valid under the distribution shifts described in the introduction.
  2. [§4.3 and Table 2] §4.3 and Table 2: The reported SOTA AUC results on the four datasets are presented without ablations that isolate the typicality module's contribution under target-domain shift or without error bars/statistical tests across the 100+ scenes; this weakens the ability to attribute gains specifically to the LLM-distilled representations rather than the test-time adaptation alone.
  3. [§3.2] §3.2 (context uniqueness analysis): The test-time module derives scene-adaptive boundaries from spatio-temporal differences, but the paper does not analyze or bound the case where the upstream typicality representations misalign with target-scene patterns; an explicit discussion or failure-case experiment would be needed to secure the generalization argument.
minor comments (3)
  1. [Figure 1] Figure 1: The overall architecture diagram would be clearer with explicit arrows or labels indicating the flow from skeleton input through LLM distillation to the semantic space projection.
  2. [§2] §2 (related work): A few recent LLM-augmented VAD papers are not cited, which would help situate the language-guided module more precisely.
  3. [Abstract] Abstract: The phrase 'over 100 unseen surveillance scenes' is used without a per-dataset breakdown or total count, which would strengthen the scale claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We address each of the major comments point by point below. We have made revisions to the manuscript to incorporate additional analyses and discussions as suggested.

read point-by-point responses
  1. Referee: [§3.1] §3.1 (language-guided semantic typicality modeling): The projection of skeleton snippets into action semantic space and distillation of LLM priors on normal/abnormal behaviors is central to the zero-shot claim, yet the manuscript provides no quantitative measure of semantic coverage or cross-scene action overlap (e.g., between ShanghaiTech and UCF-Crime distributions); without this, it remains unclear whether the learned typicality distinctions remain valid under the distribution shifts described in the introduction.

    Authors: We appreciate this observation, as quantifying the semantic coverage is indeed important for validating the zero-shot generalization. In the revised manuscript, we have added a quantitative analysis in Section 3.1, including metrics for action embedding overlap and coverage across the datasets (ShanghaiTech, UBnormal, NWPU, UCF-Crime). Specifically, we compute the average cosine similarity between action semantic embeddings from different scenes and report the percentage of actions that have representations in the LLM-distilled space. This analysis confirms substantial overlap and broad coverage, supporting that the typicality distinctions hold under the described distribution shifts. revision: yes

  2. Referee: [§4.3 and Table 2] §4.3 and Table 2: The reported SOTA AUC results on the four datasets are presented without ablations that isolate the typicality module's contribution under target-domain shift or without error bars/statistical tests across the 100+ scenes; this weakens the ability to attribute gains specifically to the LLM-distilled representations rather than the test-time adaptation alone.

    Authors: We agree that isolating the contribution of the typicality module and providing statistical validation would strengthen the empirical claims. We have conducted additional ablation studies in Section 4.3, where we evaluate the performance with and without the language-guided semantic typicality modeling module under the zero-shot setting (no target-domain training). The results show a notable drop in AUC when the typicality module is removed, highlighting its importance. Additionally, we now report error bars (standard deviation over 5 runs) and include statistical significance tests (Wilcoxon signed-rank test) across the results from over 100 scenes to demonstrate that the improvements are statistically significant. revision: yes

  3. Referee: [§3.2] §3.2 (context uniqueness analysis): The test-time module derives scene-adaptive boundaries from spatio-temporal differences, but the paper does not analyze or bound the case where the upstream typicality representations misalign with target-scene patterns; an explicit discussion or failure-case experiment would be needed to secure the generalization argument.

    Authors: This point raises a critical aspect of the generalization argument. We have revised Section 3.2 to include an explicit discussion of potential misalignment between the upstream typicality representations and target-scene patterns, such as when novel actions in the target domain are not well-covered by the LLM priors. We also present a failure-case experiment in the supplementary material, where we intentionally introduce misalignment by using a reduced set of LLM knowledge and analyze the resulting performance degradation. This provides empirical bounds on the method's robustness and shows that the context uniqueness analysis helps mitigate some misalignment effects. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation chain remains independent of fitted inputs or self-citation reductions

full rationale

The paper's central claims rest on a language-guided semantic typicality module that projects skeleton snippets into an action semantic space and distills LLM priors on normal/abnormal behaviors, combined with a test-time context uniqueness module for scene-adaptive boundaries. These steps are presented as external knowledge transfer and local adaptation rather than quantities defined by construction from the target performance metrics or source-domain fits. No equations or descriptions reduce the zero-shot SOTA results on unseen scenes to a self-referential fit, renamed pattern, or load-bearing self-citation whose validity depends on the current paper. The method is self-contained against external benchmarks (LLM knowledge and cross-dataset evaluation), satisfying the criteria for an honest non-finding of circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that skeleton representations remove appearance and background domain gaps and that LLM knowledge of action typicality can be effectively transferred to skeleton snippets. No free parameters or invented entities are explicitly introduced in the abstract.

axioms (2)
  • domain assumption Skeleton data eliminates domain disparities in background and human appearance, enabling generalization across scenes.
    Stated in the abstract as the inherent generalizable advantage of the skeleton-based approach.
  • domain assumption LLM knowledge of typical normal and abnormal behaviors can be distilled into skeleton semantic space to improve zero-shot performance.
    Core premise of the language-guided semantic typicality modeling module described in the abstract.

pith-pipeline@v0.9.0 · 5764 in / 1471 out tokens · 31601 ms · 2026-05-18T17:11:02.188338+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 2 internal anchors

  1. [1]

    Future frame prediction for anomaly detection–a new baseline,

    W. Liu, W. Luo, D. Lian, and S. Gao, “Future frame prediction for anomaly detection–a new baseline,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6536– 6545

  2. [2]

    Real-world anomaly detection in surveillance videos,

    W. Sultani, C. Chen, and M. Shah, “Real-world anomaly detection in surveillance videos,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6479–6488

  3. [3]

    Video anomaly detection by solving decoupled spatio-temporal jigsaw puz- zles,

    G. Wang, Y . Wang, J. Qin, D. Zhang, X. Bao, and D. Huang, “Video anomaly detection by solving decoupled spatio-temporal jigsaw puz- zles,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 494–511

  4. [4]

    A hybrid video anomaly detection framework via memory-augmented flow reconstruction and flow-guided frame prediction,

    Z. Liu, Y . Nie, C. Long, Q. Zhang, and G. Li, “A hybrid video anomaly detection framework via memory-augmented flow reconstruction and flow-guided frame prediction,” inProceedings of the IEEE/CVF inter- national conference on computer vision, 2021, pp. 13 588–13 597. 9

  5. [5]

    Advancing Pre-trained Teacher: Towards Robust Feature Discrepancy for Anomaly Detection

    C. Tang, S. Zhou, Y . Li, Y . Dong, and L. Wang, “Advancing pre-trained teacher: towards robust feature discrepancy for anomaly detection,” arXiv preprint arXiv:2405.02068, 2024

  6. [6]

    Normalizing flows for human pose anomaly detection,

    O. Hirschorn and S. Avidan, “Normalizing flows for human pose anomaly detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 13 545–13 554

  7. [7]

    Memory-augmented appearance-motion network for video anomaly detection,

    L. Wang, J. Tian, S. Zhou, H. Shi, and G. Hua, “Memory-augmented appearance-motion network for video anomaly detection,”Pattern Recognition, vol. 138, p. 109335, 2023

  8. [8]

    Look around for anomalies: weakly-supervised anomaly detection via context-motion relational learning,

    M. Cho, M. Kim, S. Hwang, C. Park, K. Lee, and S. Lee, “Look around for anomalies: weakly-supervised anomaly detection via context-motion relational learning,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 12 137–12 146

  9. [9]

    Abnormal ratios guided multi-phase self-training for weakly-supervised video anomaly detection,

    H. Shi, L. Wang, S. Zhou, G. Hua, and W. Tang, “Abnormal ratios guided multi-phase self-training for weakly-supervised video anomaly detection,”IEEE Transactions on Multimedia, 2023

  10. [10]

    Arc: A generalist graph anomaly detector with in-context learning,

    Y . Liu, S. Li, Y . Zheng, Q. Chen, C. Zhang, and S. Pan, “Arc: A generalist graph anomaly detector with in-context learning,”arXiv preprint arXiv:2405.16771, 2024

  11. [11]

    Winclip: Zero-/few-shot anomaly classification and segmentation,

    J. Jeong, Y . Zou, T. Kim, D. Zhang, A. Ravichandran, and O. Dabeer, “Winclip: Zero-/few-shot anomaly classification and segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19 606–19 616

  12. [12]

    Filo: Zero-shot anomaly detection by fine-grained description and high- quality localization,

    Z. Gu, B. Zhu, G. Zhu, Y . Chen, H. Li, M. Tang, and J. Wang, “Filo: Zero-shot anomaly detection by fine-grained description and high- quality localization,”arXiv preprint arXiv:2404.13671, 2024

  13. [13]

    AdaCLIP: Adapting CLIP with Hybrid Learnable Prompts for Zero-Shot Anomaly Detection

    Y . Cao, J. Zhang, L. Frittoli, Y . Cheng, W. Shen, and G. Boracchi, “Adaclip: Adapting clip with hybrid learnable prompts for zero-shot anomaly detection,”arXiv preprint arXiv:2407.15795, 2024

  14. [14]

    AnomalyCLIP: Object- agnostic prompt learning for zero-shot anomaly detection,

    Q. Zhou, G. Pang, Y . Tian, S. He, and J. Chen, “AnomalyCLIP: Object- agnostic prompt learning for zero-shot anomaly detection,” inThe Twelfth International Conference on Learning Representations, 2024

  15. [15]

    Cross-domain video anomaly detection without target domain adaptation,

    A. Aich, K.-C. Peng, and A. K. Roy-Chowdhury, “Cross-domain video anomaly detection without target domain adaptation,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 2579–2591

  16. [16]

    Mulde: Multiscale log-density estimation via denoising score matching for video anomaly detection,

    J. Micorek, H. Possegger, D. Narnhofer, H. Bischof, and M. Kozinski, “Mulde: Multiscale log-density estimation via denoising score matching for video anomaly detection,” inProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2024, pp. 18 868– 18 877

  17. [17]

    Learning regularity in skeleton trajectories for anomaly detection in videos,

    R. Morais, V . Le, T. Tran, B. Saha, M. Mansour, and S. Venkatesh, “Learning regularity in skeleton trajectories for anomaly detection in videos,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 11 996–12 004

  18. [18]

    Regularity learning via explicit distribution modeling for skeletal video anomaly detection,

    S. Yu, Z. Zhao, H. Fang, A. Deng, H. Su, D. Wang, W. Gan, C. Lu, and W. Wu, “Regularity learning via explicit distribution modeling for skeletal video anomaly detection,”IEEE Transactions on Circuits and Systems for Video Technology, 2023

  19. [19]

    Multimodal motion conditioned diffusion model for skeleton-based video anomaly detection,

    A. Flaborea, L. Collorone, G. M. D. Di Melendugno, S. D’Arrigo, B. Prenkaj, and F. Galasso, “Multimodal motion conditioned diffusion model for skeleton-based video anomaly detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 10 318–10 329

  20. [20]

    Openpose: Realtime multi-person 2d pose estimation using part affinity fields,

    Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y . Sheikh, “Openpose: Realtime multi-person 2d pose estimation using part affinity fields,” IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 1, pp. 172–186, 2019

  21. [21]

    Alphapose: Whole-body regional multi-person pose estimation and tracking in real-time,

    H.-S. Fang, J. Li, H. Tang, C. Xu, H. Zhu, Y . Xiu, Y .-L. Li, and C. Lu, “Alphapose: Whole-body regional multi-person pose estimation and tracking in real-time,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 6, pp. 7157–7173, 2022

  22. [22]

    Ubnormal: New benchmark for supervised open-set video anomaly detection,

    A. Acsintoae, A. Florescu, M.-I. Georgescu, T. Mare, P. Sumedrea, R. T. Ionescu, F. S. Khan, and M. Shah, “Ubnormal: New benchmark for supervised open-set video anomaly detection,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 20 143–20 153

  23. [23]

    A new comprehensive bench- mark for semi-supervised video anomaly detection and anticipation,

    C. Cao, Y . Lu, P. Wang, and Y . Zhang, “A new comprehensive bench- mark for semi-supervised video anomaly detection and anticipation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 20 392–20 401

  24. [24]

    Hierarchical semantic contrast for scene-aware video anomaly detection,

    S. Sun and X. Gong, “Hierarchical semantic contrast for scene-aware video anomaly detection,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 22 846–22 856

  25. [25]

    Graph embedded pose clustering for anomaly detection,

    A. Markovitz, G. Sharir, I. Friedman, L. Zelnik-Manor, and S. Avidan, “Graph embedded pose clustering for anomaly detection,” inProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 539–10 547

  26. [26]

    Frequency-guided diffusion model with perturbation training for skeleton-based video anomaly detection,

    X. Tan, H. Wang, X. Geng, and L. Wang, “Frequency-guided diffusion model with perturbation training for skeleton-based video anomaly detection,”arXiv preprint arXiv:2412.03044, 2024

  27. [27]

    Glow: Generative flow with invertible 1x1 convolutions,

    D. P. Kingma and P. Dhariwal, “Glow: Generative flow with invertible 1x1 convolutions,”Advances in neural information processing systems, vol. 31, 2018

  28. [28]

    Da-flow: Dual attention normalizing flow for skeleton-based video anomaly detection,

    R. Wu, Y . Chen, J. Xiao, B. Li, J. Fan, F. Dufaux, C. Zhu, and Y . Liu, “Da-flow: Dual attention normalizing flow for skeleton-based video anomaly detection,”arXiv preprint arXiv:2406.02976, 2024

  29. [29]

    Zero- shot anomaly detection via batch normalization,

    A. Li, C. Qiu, M. Kloft, P. Smyth, M. Rudolph, and S. Mandt, “Zero- shot anomaly detection via batch normalization,”Advances in Neural Information Processing Systems, vol. 36, 2024

  30. [30]

    Zero-shot versus many- shot: Unsupervised texture anomaly detection,

    T. Aota, L. T. T. Tong, and T. Okatani, “Zero-shot versus many- shot: Unsupervised texture anomaly detection,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 5564–5572

  31. [31]

    A zero-/fewshot anomaly classification and segmentation method for cvpr 2023 vand workshop challenge tracks 1&2: 1st place on zero-shot ad and 4th place on few-shot ad,

    X. Chen, Y . Han, and J. Zhang, “A zero-/fewshot anomaly classification and segmentation method for cvpr 2023 vand workshop challenge tracks 1&2: 1st place on zero-shot ad and 4th place on few-shot ad,”arXiv preprint arXiv:2305.17382, vol. 2, no. 4, 2023

  32. [32]

    Generalized out-of-distribution detection and beyond in vision language model era: A survey,

    A. Miyai, J. Yang, J. Zhang, Y . Ming, Y . Lin, Q. Yu, G. Irie, S. Joty, Y . Li, H. Liet al., “Generalized out-of-distribution detection and beyond in vision language model era: A survey,”arXiv preprint arXiv:2407.21794, 2024

  33. [33]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PMLR, 2021, pp. 8748–8763

  34. [34]

    Ada-vad: Domain adaptable video anomaly detection,

    D. Guo, Y . Fu, and S. Li, “Ada-vad: Domain adaptable video anomaly detection,” inProceedings of the 2024 SIAM International Conference on Data Mining (SDM). SIAM, 2024, pp. 634–642

  35. [35]

    Harness- ing large language models for training-free video anomaly detection,

    L. Zanella, W. Menapace, M. Mancini, Y . Wang, and E. Ricci, “Harness- ing large language models for training-free video anomaly detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 18 527–18 536

  36. [36]

    Prompt-guided zero-shot anomaly action recognition using pretrained deep skeleton features,

    F. Sato, R. Hachiuma, and T. Sekii, “Prompt-guided zero-shot anomaly action recognition using pretrained deep skeleton features,” inPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 6471–6480

  37. [37]

    Quo vadis, action recognition? a new model and the kinetics dataset,

    J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” inproceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308

  38. [38]

    Spatial temporal graph convolutional networks for skeleton-based action recognition,

    S. Yan, Y . Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeleton-based action recognition,” inProceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018

  39. [39]

    Parallel attention interaction network for few-shot skeleton-based action recognition,

    X. Liu, S. Zhou, L. Wang, and G. Hua, “Parallel attention interaction network for few-shot skeleton-based action recognition,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 1379–1388

  40. [40]

    Learning discriminative spatio- temporal representations for semi-supervised action recognition,

    Y . Wang, S. Zhou, K. Xia, and L. Wang, “Learning discriminative spatio- temporal representations for semi-supervised action recognition,”arXiv preprint arXiv:2404.16416, 2024

  41. [41]

    Actionclip: A new paradigm for video action recognition,

    M. Wang, J. Xing, and Y . Liu, “Actionclip: A new paradigm for video action recognition,”arXiv preprint arXiv:2109.08472, 2021

  42. [42]

    Generative action description prompts for skeleton-based action recognition,

    W. Xiang, C. Li, Y . Zhou, B. Wang, and L. Zhang, “Generative action description prompts for skeleton-based action recognition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 10 276–10 285

  43. [43]

    Abnormal event detection at 150 fps in matlab,

    C. Lu, J. Shi, and J. Jia, “Abnormal event detection at 150 fps in matlab,” inProceedings of the IEEE international conference on computer vision, 2013, pp. 2720–2727

  44. [44]

    Anomaly detection and localization in crowded scenes,

    W. Li, V . Mahadevan, and N. Vasconcelos, “Anomaly detection and localization in crowded scenes,”IEEE transactions on pattern analysis and machine intelligence, vol. 36, no. 1, pp. 18–32, 2013

  45. [45]

    Imagebind: One embedding space to bind them all,

    R. Girdhar, A. El-Nouby, Z. Liu, M. Singh, K. V . Alwala, A. Joulin, and I. Misra, “Imagebind: One embedding space to bind them all,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 15 180–15 190

  46. [46]

    Channel- wise topology refinement graph convolution for skeleton-based action recognition,

    Y . Chen, Z. Zhang, C. Yuan, B. Li, Y . Deng, and W. Hu, “Channel- wise topology refinement graph convolution for skeleton-based action recognition,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 13 359–13 368

  47. [47]

    Explicit boundary guided semi-push-pull contrastive learning for supervised anomaly detection,

    X. Yao, R. Li, J. Zhang, J. Sun, and C. Zhang, “Explicit boundary guided semi-push-pull contrastive learning for supervised anomaly detection,” 10 inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 24 490–24 499

  48. [48]

    Ssmtl++: Revisiting self-supervised multi-task learning for video anomaly detec- tion,

    A. Barbalau, R. T. Ionescu, M.-I. Georgescu, J. Dueholm, B. Ramachan- dra, K. Nasrollahi, F. S. Khan, T. B. Moeslund, and M. Shah, “Ssmtl++: Revisiting self-supervised multi-task learning for video anomaly detec- tion,”Computer Vision and Image Understanding, vol. 229, p. 103656, 2023

  49. [49]

    Improved baselines with visual instruction tuning,

    H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved baselines with visual instruction tuning,” 2023

  50. [50]

    Qwen2.5-VL Technical Report

    S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tanget al., “Qwen2. 5-vl technical report,”arXiv preprint arXiv:2502.13923, 2025

  51. [51]

    Tspo: Temporal sampling policy optimization for long- form video language understanding,

    C. Tang, Z. Han, H. Sun, S. Zhou, X. Zhang, X. Wei, Y . Yuan, J. Xu, and H. Sun, “Tspo: Temporal sampling policy optimization for long- form video language understanding,”arXiv preprint arXiv:2508.04369, 2025