pith. machine review for the scientific record. sign in

arxiv: 2604.07772 · v1 · submitted 2026-04-09 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

ESOM: Efficiently Understanding Streaming Video Anomalies with Open-world Dynamic Definitions

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:51 UTC · model grok-4.3

classification 💻 cs.CV
keywords open-world video anomaly detectionstreaming video processingtraining-freeanomaly localizationdynamic definitionstoken merginghybrid memoryOpenDef-Bench
0
0 comments X

The pith

ESOM processes streaming videos to detect and describe user-defined anomalies in real time without training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes ESOM as an efficient streaming model for open-world video anomaly detection that operates without training. It incorporates modules to normalize user prompts for anomaly definitions, merge redundant visual tokens between frames, maintain a hybrid memory for streaming inference, and convert textual outputs to frame-level scores. This setup aims to overcome the inefficiency, non-streaming nature, and limited dynamic definition support of previous MLLM-based approaches. A new benchmark called OpenDef-Bench is introduced to test performance across varying natural anomaly definitions in clean surveillance videos. If successful, this would allow practical, real-time deployment in applications like intelligent surveillance and live-streaming moderation on standard hardware.

Core claim

ESOM is a training-free model for open-world video anomaly detection in streaming settings. It structures user prompts with Definition Normalization to reduce hallucinations, compresses tokens using Inter-frame-matched Intra-frame Token Merging, uses Hybrid Streaming Memory for causal inference, and applies Probabilistic Scoring to generate frame-level anomaly scores from interval outputs. The model achieves state-of-the-art results in localization, classification, and description on the new OpenDef-Bench.

What carries the argument

ESOM's four core modules—Definition Normalization for prompt structuring, Inter-frame-matched Intra-frame Token Merging for token compression, Hybrid Streaming Memory for causal processing, and Probabilistic Scoring for score conversion—combined with the OpenDef-Bench evaluation dataset.

If this is right

  • Allows real-time processing of streaming video on a single GPU.
  • Supports dynamic, user-specified anomaly definitions without retraining the model.
  • Enables causal inference suitable for live applications.
  • Reduces hallucinations in anomaly descriptions through normalized prompts.
  • Provides a standardized benchmark for comparing open-world VAD methods under varying conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The efficiency gains from token merging could make such systems viable for mobile or edge deployment in surveillance.
  • OpenDef-Bench might encourage development of models that handle even more diverse or ambiguous anomaly definitions.
  • The probabilistic scoring approach could be adapted to other video understanding tasks requiring frame-level outputs from language models.
  • Combining ESOM with additional memory mechanisms might further improve long-term streaming performance.

Load-bearing premise

The assumption that the four modules together enable effective open-world detection, hallucination reduction, and causal streaming inference without requiring any training or fine-tuning.

What would settle it

Running ESOM on new streaming videos with novel anomaly definitions and finding that its localization accuracy or description quality falls below that of trained or non-streaming MLLM baselines.

Figures

Figures reproduced from arXiv: 2604.07772 by Jianqin Wu, Linlin Yang, Wenna Li, Xiaoyu Wu, Zihao Liu.

Figure 1
Figure 1. Figure 1: Motivation of ESOM and OpenDef-Bench. (a) ESOM addresses [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: ESOM is a training-free streaming framework for open-world video anomaly detection under dynamic anomaly definitions. Given a raw user prompt, [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The construction pipeline and example samples of the proposed [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: T-SNE visualization of text embeddings of anomaly definitions from [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 4
Figure 4. Figure 4: Statistics of OpenDef-Bench. With high video resolution, long video [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of different token compression methods [47]–[49] under [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
read the original abstract

Open-world video anomaly detection (OWVAD) aims to detect and explain abnormal events under different anomaly definitions, which is important for applications such as intelligent surveillance and live-streaming content moderation. Recent MLLM-based methods have shown promising open-world generalization, but still suffer from three major limitations: inefficiency for practical deployment, lack of streaming processing adaptation, and limited support for dynamic anomaly definitions in both modeling and evaluation. To address these issues, this paper proposes ESOM, an efficient streaming OWVAD model that operates in a training-free manner. ESOM includes a Definition Normalization module to structure user prompts for reducing hallucination, an Inter-frame-matched Intra-frame Token Merging module to compress redundant visual tokens, a Hybrid Streaming Memory module for efficient causal inference, and a Probabilistic Scoring module that converts interval-level textual outputs into frame-level anomaly scores. In addition, this paper introduces OpenDef-Bench, a new benchmark with clean surveillance videos and diverse natural anomaly definitions for evaluating performance under varying conditions. Extensive experiments show that ESOM achieves real-time efficiency on a single GPU and state-of-the-art performance in anomaly temporal localization, classification, and description generation. The code and benchmark will be released at https://github.com/Kamino666/ESOM_OpenDef-Bench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes ESOM, a training-free model for open-world streaming video anomaly detection (OWVAD). It introduces four modules—Definition Normalization to structure prompts and reduce hallucination, Inter-frame-matched Intra-frame Token Merging to compress visual tokens, Hybrid Streaming Memory for causal inference, and Probabilistic Scoring to derive frame-level anomaly scores from textual outputs—along with the new OpenDef-Bench benchmark containing clean surveillance videos and diverse natural anomaly definitions. The central claim is that ESOM achieves real-time single-GPU inference and state-of-the-art performance on anomaly temporal localization, classification, and description generation.

Significance. If the performance and efficiency claims are substantiated, this work would represent a meaningful step toward practical OWVAD systems for surveillance and live-streaming moderation. The training-free design, streaming adaptation, and support for dynamic definitions address documented limitations of prior MLLM-based approaches, while the new benchmark could enable more rigorous evaluation of open-world generalization.

major comments (2)
  1. Abstract: The claim of achieving state-of-the-art performance in temporal localization, classification, and description generation is presented without reference to specific baselines, quantitative metrics, or error analysis, which prevents full assessment of whether the four modules deliver the asserted gains over existing methods.
  2. The weakest assumption—that Definition Normalization, Inter-frame-matched Intra-frame Token Merging, Hybrid Streaming Memory, and Probabilistic Scoring together enable effective open-world detection, hallucination reduction, and causal streaming inference without any training—requires explicit ablation or component-wise results to confirm it is load-bearing for the SOTA claim.
minor comments (1)
  1. The commitment to release code and the OpenDef-Bench benchmark is noted positively; ensure the release includes all prompts, preprocessing details, and evaluation scripts to support reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight opportunities to better substantiate our claims. We address each major comment below and indicate the corresponding revisions to the manuscript.

read point-by-point responses
  1. Referee: Abstract: The claim of achieving state-of-the-art performance in temporal localization, classification, and description generation is presented without reference to specific baselines, quantitative metrics, or error analysis, which prevents full assessment of whether the four modules deliver the asserted gains over existing methods.

    Authors: We agree that the abstract would benefit from greater specificity to allow readers to evaluate the SOTA claim more readily. The experiments section already contains detailed quantitative comparisons against relevant baselines for temporal localization (F1-score), classification accuracy, and description generation quality, together with supporting error analysis. In the revised manuscript we have updated the abstract to include brief references to these key metrics and baselines while directing readers to the full tables and analysis in the main text. This change strengthens the abstract without altering the reported experimental outcomes. revision: yes

  2. Referee: The weakest assumption—that Definition Normalization, Inter-frame-matched Intra-frame Token Merging, Hybrid Streaming Memory, and Probabilistic Scoring together enable effective open-world detection, hallucination reduction, and causal streaming inference without any training—requires explicit ablation or component-wise results to confirm it is load-bearing for the SOTA claim.

    Authors: The referee correctly notes that component-wise validation would more rigorously demonstrate the contribution of each module to the overall performance. While the original manuscript reports end-to-end results for the complete ESOM system, we acknowledge the value of explicit ablations. We have added a dedicated ablation study in the revised version that isolates the effect of removing or altering each module individually. The new results quantify impacts on hallucination rates, inference latency, and detection metrics, confirming that the four modules are collectively load-bearing for the training-free open-world performance gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents ESOM as a training-free composition of four explicitly described modules (Definition Normalization to structure prompts and reduce hallucination, Inter-frame-matched Intra-frame Token Merging for token compression, Hybrid Streaming Memory for causal streaming inference, and Probabilistic Scoring to convert interval outputs to frame-level scores) plus the newly introduced OpenDef-Bench benchmark. No equations, fitted parameters, or derivations appear in the abstract or module descriptions that reduce any claimed output to an input by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The performance claims rest on experimental results on the external benchmark rather than internal re-labeling or self-referential fitting, making the derivation chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Limited information from abstract only; no specific free parameters or axioms detailed beyond standard use of MLLMs and computer vision techniques.

pith-pipeline@v0.9.0 · 5534 in / 1158 out tokens · 43589 ms · 2026-05-10T17:51:34.577081+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/Breath1024.lean period8 echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    a GoF structure with size 8 is adopted, where the first frame is treated as an I-frame, the last frame as a P-frame, and the remaining frames as B-frames. Correspondingly, the token retention ratios for B-frames and P-frames are set to γ_B = 0.2 and γ_P = 0.6

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    The DN module converts user prompt into a structured anomaly definition table to reduce hallucination... Probabilistic Scoring (PS) module that converts interval-level textual outputs into frame-level anomaly scores

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 12 canonical work pages · 4 internal anchors

  1. [1]

    Deep learning for video anomaly detection: A review,

    P. Wu, C. Pan, Y . Yan, G. Pang, P. Wang, and Y . Zhang, “Deep learning for video anomaly detection: A review,”arxiv preprint, vol. abs/2409.05383, 2024

  2. [2]

    Open- vocabulary video anomaly detection,

    P. Wu, X. Zhou, G. Pang, Y . Sun, J. Liu, P. Wang, and Y . Zhang, “Open- vocabulary video anomaly detection,” inCVPR. Seattle, USA: IEEE, 2024, pp. 18 297–18 307

  3. [3]

    Anomize: Better open vocabulary video anomaly detection,

    F. Li, W. Liu, J. Chen, R. Zhang, Y . Wang, X. Zhong, and Z. Wang, “Anomize: Better open vocabulary video anomaly detection,” inCVPR, 2025, pp. 29 203–29 212

  4. [4]

    Mul- timodal evidential learning for open-world weakly-supervised video anomaly detection,

    C. Huang, W. Huang, Q. Jiang, W. Wang, J. Wen, and B. Zhang, “Mul- timodal evidential learning for open-world weakly-supervised video anomaly detection,”TMM, 2025. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 10

  5. [5]

    Harness- ing large language models for training-free video anomaly detection,

    L. Zanella, W. Menapace, M. Mancini, Y . Wang, and E. Ricci, “Harness- ing large language models for training-free video anomaly detection,” inCVPR. Seattle, USA: IEEE, 2024, pp. 18 527–18 536

  6. [6]

    Holmes-vau: Towards long-term video anomaly understanding at any granularity,

    H. Zhang, X. Xu, X. Wang, J. Zuo, X. Huang, C. Gao, S. Zhang, L. Yu, and N. Sang, “Holmes-vau: Towards long-term video anomaly understanding at any granularity,” inCVPR, 2025, pp. 13 843–13 853

  7. [7]

    Language-guided open- world video anomaly detection under weak supervision,

    Z. Liu, X. Wu, J. Wu, X. Wang, and L. Yang, “Language-guided open- world video anomaly detection under weak supervision,”arXiv preprint arXiv:2503.13160, 2025

  8. [8]

    Vadclip: Adapting vision-language models for weakly supervised video anomaly detection,

    P. Wu, X. Zhou, G. Pang, L. Zhou, Q. Yan, P. Wang, and Y . Zhang, “Vadclip: Adapting vision-language models for weakly supervised video anomaly detection,” inAAAI, M. J. Wooldridge, J. G. Dy, and S. Natara- jan, Eds. Vancouver, Canada: AAAI Press, 2024, pp. 6074–6082

  9. [9]

    Glancevad: Exploring glance supervision for label-efficient video anomaly detection,

    H. Zhang, X. Wang, X. Xu, X. Huang, C. Gao, Y . Wang, S. Zhang, and N. Sang, “Glancevad: Exploring glance supervision for label-efficient video anomaly detection,” inICME. IEEE, 2025, pp. 1–6

  10. [10]

    Dctformer: A dual- branch transformer with cloze tests for video anomaly detection,

    P. Chen, S. Du, X. Zhao, J. Hu, J. Li, and T. Li, “Dctformer: A dual- branch transformer with cloze tests for video anomaly detection,”TMM, 2025

  11. [11]

    Vera: Explainable video anomaly detection via verbalized learning of vision-language models,

    M. Ye, W. Liu, and P. He, “Vera: Explainable video anomaly detection via verbalized learning of vision-language models,” inCVPR, 2025, pp. 8679–8688

  12. [12]

    Deep appearance features for abnormal behavior detection in video,

    S. Smeureanu, R. T. Ionescu, M. Popescu, and B. Alexe, “Deep appearance features for abnormal behavior detection in video,” inInt. Conf. Image Anal. Proc., ser. Lecture Notes in Computer Science, vol. 10485. Springer, 2017, pp. 779–789

  13. [13]

    A deep one-class neural network for anomalous event detection in complex scenes,

    P. Wu, J. Liu, and F. Shen, “A deep one-class neural network for anomalous event detection in complex scenes,”TNNLS, vol. 31, no. 7, pp. 2609–2622, 2020

  14. [14]

    Hawk: Learning to understand open-world video anomalies,

    J. Tang, H. LU, R. WU, X. Xu, K. Ma, C. Fang, B. Guo, J. Lu, Q. Chen, and Y . Chen, “Hawk: Learning to understand open-world video anomalies,” inNeurIPS, vol. 37, 2024, pp. 139 751–139 785

  15. [15]

    Vad-r1: Towards video anomaly reasoning via perception-to-cognition chain-of-thought,

    C. Huang, B. Wang, J. Wen, C. Liu, W. Wang, L. Shen, and X. Cao, “Vad-r1: Towards video anomaly reasoning via perception-to-cognition chain-of-thought,”arXiv preprint arXiv:2505.19877, 2025

  16. [16]

    Vau-r1: Advancing video anomaly understanding via reinforcement fine-tuning,

    L. Zhu, Q. Chen, X. Shen, and X. Cun, “Vau-r1: Advancing video anomaly understanding via reinforcement fine-tuning,”arXiv preprint arXiv:2505.23504, 2025

  17. [17]

    Panda: Towards generalist video anomaly detection via agentic ai engineer,

    Z. Yang, C. Gao, and M. Z. Shou, “Panda: Towards generalist video anomaly detection via agentic ai engineer,”arXiv preprint arXiv:2509.26386, 2025

  18. [18]

    Roformer: En- hanced transformer with rotary position embedding,

    J. Su, M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu, “Roformer: En- hanced transformer with rotary position embedding,”Neurocomputing, vol. 568, p. 127063, 2024

  19. [19]

    Towards open set video anomaly detection,

    Y . Zhu, W. Bao, and Q. Yu, “Towards open set video anomaly detection,” inECCV, ser. Lecture Notes in Computer Science, S. Avidan, G. J. Brostow, M. Ciss ´e, G. M. Farinella, and T. Hassner, Eds., vol. 13694. Tel Aviv,Israel: Springer, 2022, pp. 395–412

  20. [20]

    Enabling real-world supervised video anomaly detection: New open-set benchmark and new framework,

    H. Huang, Z. Hu, D. Feng, C. Chen, D. Li, H. Liu, and L. Duan, “Enabling real-world supervised video anomaly detection: New open-set benchmark and new framework,”TMM, 2026

  21. [21]

    Ubnormal: New benchmark for supervised open-set video anomaly detection,

    A. Acsintoae, A. Florescu, M. Georgescu, T. Mare, P. Sumedrea, R. T. Ionescu, F. S. Khan, and M. Shah, “Ubnormal: New benchmark for supervised open-set video anomaly detection,” inCVPR. New Orleans, USA: IEEE, 2022, pp. 20 111–20 121

  22. [22]

    Domain generalization for video anomaly detection considering diverse anomaly types,

    Z. Wang, X. Gu, H. Yan, and X. Gu, “Domain generalization for video anomaly detection considering diverse anomaly types,”Signal, Image and Video Processing, vol. 18, no. 4, pp. 3691–3704, 2024

  23. [23]

    Cross-domain learning for video anomaly detection with limited supervision,

    Y . Jain, A. Dabouei, and M. Xu, “Cross-domain learning for video anomaly detection with limited supervision,” inECCV. Springer, 2024, pp. 468–484

  24. [24]

    Cross-domain video anomaly detection without target domain adaptation,

    A. Aich, K.-C. Peng, and A. K. Roy-Chowdhury, “Cross-domain video anomaly detection without target domain adaptation,” inWACV, 2023, pp. 2579–2591

  25. [25]

    Human- centric behavior description in videos: New benchmark and model,

    L. Zhou, Y . Gao, M. Zhang, P. Wu, P. Wang, and Y . Zhang, “Human- centric behavior description in videos: New benchmark and model,” TMM, vol. 26, pp. 10 867–10 878, 2024

  26. [26]

    Anomaly-led prompting learning caption generating model and benchmark,

    Q. Bao, F. Liu, L. Jiao, Y . Liu, S. Li, L. Li, X. Liu, X. Wang, and B. Chen, “Anomaly-led prompting learning caption generating model and benchmark,”TMM, 2025

  27. [27]

    Uncovering what, why and how: A comprehensive benchmark for causation understanding of video anomaly,

    H. Du, S. Zhang, B. Xie, G. Nan, J. Zhang, J. Xu, H. Liu, S. Leng, J. Liu, H. Fan, D. Huang, J. Feng, L. Chen, C. Zhang, X. Li, H. Zhang, J. Chen, Q. Cui, and X. Tao, “Uncovering what, why and how: A comprehensive benchmark for causation understanding of video anomaly,” inCVPR. IEEE, 2024, pp. 18 793–18 803

  28. [28]

    Towards surveillance video-and-language understanding: New dataset, baselines, and challenges,

    T. Yuan, X. Zhang, K. Liu, B. Liu, C. Chen, J. Jin, and Z. Jiao, “Towards surveillance video-and-language understanding: New dataset, baselines, and challenges,” inCVPR, 2024, pp. 22 052–22 061

  29. [29]

    Enhancing video anomaly understanding via multi-task instruction tuning,

    X. Wang, X. Wu, and Z. Liu, “Enhancing video anomaly understanding via multi-task instruction tuning,”IEEE Signal Processing Letters, vol. 32, pp. 4359–4363, 2025

  30. [30]

    Towards multi-domain learning for generalizable video anomaly detection,

    M. Cho, T. Kim, M. Shim, D. Wee, and S. Lee, “Towards multi-domain learning for generalizable video anomaly detection,”NeurIPS, vol. 37, pp. 50 256–50 284, 2024

  31. [31]

    Adaptive keyframe sampling for long video understanding,

    X. Tang, J. Qiu, L. Xie, Y . Tian, J. Jiao, and Q. Ye, “Adaptive keyframe sampling for long video understanding,” inCVPR, 2025, pp. 29 118– 29 128

  32. [32]

    Framefusion: Combining similarity and importance for video token reduction on large vision language models,

    T. Fu, T. Liu, Q. Han, G. Dai, S. Yan, H. Yang, X. Ning, and Y . Wang, “Framefusion: Combining similarity and importance for video token reduction on large vision language models,” inICCV, 2025, pp. 22 654– 22 663

  33. [33]

    Overview of the h. 264/avc video coding standard,

    T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overview of the h. 264/avc video coding standard,”TCSVT, vol. 13, no. 7, pp. 560–576, 2003

  34. [34]

    Efficient streaming language models with attention sinks,

    G. Xiao, Y . Tian, B. Chen, S. Han, and M. Lewis, “Efficient streaming language models with attention sinks,”arXiv, 2023

  35. [35]

    Rethinking met- rics and benchmarks of video anomaly detection,

    Z. Liu, X. Wu, W. Li, L. Yang, and S. Wang, “Rethinking met- rics and benchmarks of video anomaly detection,”arXiv preprint arXiv:2505.19022, 2025

  36. [36]

    Chain-of-thought reasoning without prompting,

    X. Wang and D. Zhou, “Chain-of-thought reasoning without prompting,” NeurIPS, vol. 37, pp. 66 383–66 409, 2024

  37. [37]

    Graph convolutional label noise cleaner: Train a plug-and-play action classifier for anomaly detection,

    J. Zhong, N. Li, W. Kong, S. Liu, T. H. Li, and G. Li, “Graph convolutional label noise cleaner: Train a plug-and-play action classifier for anomaly detection,” inCVPR, 2019, pp. 1237–1246

  38. [38]

    A new comprehensive bench- mark for semi-supervised video anomaly detection and anticipation,

    C. Cao, Y . Lu, P. Wang, and Y . Zhang, “A new comprehensive bench- mark for semi-supervised video anomaly detection and anticipation,” in CVPR. Vancouver, Canada: IEEE, 2023, pp. 20 392–20 401

  39. [39]

    A large-scale benchmark dataset for event recognition in surveillance video,

    S. Oh, A. Hoogs, A. Perera, N. Cuntoor, C.-C. Chen, J. T. Lee, S. Mukherjee, J. K. Aggarwal, H. Lee, L. Daviset al., “A large-scale benchmark dataset for event recognition in surveillance video,” inCVPR. IEEE, 2011, pp. 3153–3160

  40. [40]

    Meva: A large-scale multiview, multimodal video dataset for activity detection,

    K. Corona, K. Osterdahl, R. Collins, and A. Hoogs, “Meva: A large-scale multiview, multimodal video dataset for activity detection,” inWACV, 2021, pp. 1060–1068

  41. [41]

    Real-world anomaly detection in surveillance videos,

    W. Sultani, C. Chen, and M. Shah, “Real-world anomaly detection in surveillance videos,” inCVPR. Computer Vision Foundation / IEEE Computer Society, 2018, pp. 6479–6488

  42. [42]

    Not only look, but also listen: Learning multimodal violence detection under weak supervision,

    P. Wu, J. Liu, Y . Shi, Y . Sun, F. Shao, Z. Wu, and Z. Yang, “Not only look, but also listen: Learning multimodal violence detection under weak supervision,” inECCV, ser. Lecture Notes in Computer Science, vol. 12375. Springer, 2020, pp. 322–339

  43. [43]

    Advancing video anomaly detection: A concise review and a new dataset,

    L. Zhu, L. Wang, A. Raj, T. Gedeon, and C. Chen, “Advancing video anomaly detection: A concise review and a new dataset,” inNeurIPS, vol. 37. Curran Associates, Inc., 2024, pp. 89 943–89 977

  44. [44]

    Qwen3 Technical Report

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

  45. [45]

    Learning prompt-enhanced context features for weakly-supervised video anomaly detection,

    Y . Pu, X. Wu, L. Yang, and S. Wang, “Learning prompt-enhanced context features for weakly-supervised video anomaly detection,”TIP, vol. 33, pp. 4923–4936, 2024

  46. [46]

    Streamavatar: Streaming diffusion models for real-time interactive human avatars,

    Z. Sun, Z. Peng, Y . Ma, Y . Chen, Z. Zhou, Z. Zhou, G. Zhang, Y . Zhang, Y . Zhou, Q. Luet al., “Streamavatar: Streaming diffusion models for real-time interactive human avatars,”arXiv preprint arXiv:2512.22065, 2025

  47. [47]

    Token Merging: Your ViT But Faster

    D. Bolya, C.-Y . Fu, X. Dai, P. Zhang, C. Feichtenhofer, and J. Hoffman, “Token merging: Your vit but faster,”arXiv preprint arXiv:2210.09461, 2022

  48. [48]

    Dycoke: Dynamic compression of tokens for fast video large language models,

    K. Tao, C. Qin, H. You, Y . Sui, and H. Wang, “Dycoke: Dynamic compression of tokens for fast video large language models,” inCVPR, 2025, pp. 18 992–19 001

  49. [49]

    Kwai keye-vl 1.5 technical report.arXiv preprint arXiv:2509.01563, 2025

    B. Yang, B. Wen, B. Ding, C. Liu, C. Chu, C. Song, C. Rao, C. Yi, D. Li, D. Zanget al., “Kwai keye-vl 1.5 technical report,”arXiv preprint arXiv:2509.01563, 2025

  50. [50]

    Qwen2.5-VL Technical Report

    S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y . Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y . Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin, “Qwen2.5-VL Technical Report,”arXiv preprint arXiv:2502.13923, Feb. 2025

  51. [51]

    Qwen3-VL Technical Report

    S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Geet al., “Qwen3-vl technical report,”arXiv preprint arXiv:2511.21631, 2025