arxiv: 2604.07772 · v1 · submitted 2026-04-09 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

ESOM: Efficiently Understanding Streaming Video Anomalies with Open-world Dynamic Definitions

Zihao Liu , Xiaoyu Wu , Wenna Li , Jianqin Wu , Linlin Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:51 UTC · model grok-4.3

classification 💻 cs.CV

keywords open-world video anomaly detectionstreaming video processingtraining-freeanomaly localizationdynamic definitionstoken merginghybrid memoryOpenDef-Bench

0 comments

The pith

ESOM processes streaming videos to detect and describe user-defined anomalies in real time without training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes ESOM as an efficient streaming model for open-world video anomaly detection that operates without training. It incorporates modules to normalize user prompts for anomaly definitions, merge redundant visual tokens between frames, maintain a hybrid memory for streaming inference, and convert textual outputs to frame-level scores. This setup aims to overcome the inefficiency, non-streaming nature, and limited dynamic definition support of previous MLLM-based approaches. A new benchmark called OpenDef-Bench is introduced to test performance across varying natural anomaly definitions in clean surveillance videos. If successful, this would allow practical, real-time deployment in applications like intelligent surveillance and live-streaming moderation on standard hardware.

Core claim

ESOM is a training-free model for open-world video anomaly detection in streaming settings. It structures user prompts with Definition Normalization to reduce hallucinations, compresses tokens using Inter-frame-matched Intra-frame Token Merging, uses Hybrid Streaming Memory for causal inference, and applies Probabilistic Scoring to generate frame-level anomaly scores from interval outputs. The model achieves state-of-the-art results in localization, classification, and description on the new OpenDef-Bench.

What carries the argument

ESOM's four core modules—Definition Normalization for prompt structuring, Inter-frame-matched Intra-frame Token Merging for token compression, Hybrid Streaming Memory for causal processing, and Probabilistic Scoring for score conversion—combined with the OpenDef-Bench evaluation dataset.

If this is right

Allows real-time processing of streaming video on a single GPU.
Supports dynamic, user-specified anomaly definitions without retraining the model.
Enables causal inference suitable for live applications.
Reduces hallucinations in anomaly descriptions through normalized prompts.
Provides a standardized benchmark for comparing open-world VAD methods under varying conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The efficiency gains from token merging could make such systems viable for mobile or edge deployment in surveillance.
OpenDef-Bench might encourage development of models that handle even more diverse or ambiguous anomaly definitions.
The probabilistic scoring approach could be adapted to other video understanding tasks requiring frame-level outputs from language models.
Combining ESOM with additional memory mechanisms might further improve long-term streaming performance.

Load-bearing premise

The assumption that the four modules together enable effective open-world detection, hallucination reduction, and causal streaming inference without requiring any training or fine-tuning.

What would settle it

Running ESOM on new streaming videos with novel anomaly definitions and finding that its localization accuracy or description quality falls below that of trained or non-streaming MLLM baselines.

Figures

Figures reproduced from arXiv: 2604.07772 by Jianqin Wu, Linlin Yang, Wenna Li, Xiaoyu Wu, Zihao Liu.

**Figure 2.** Figure 2: ESOM is a training-free streaming framework for open-world video anomaly detection under dynamic anomaly definitions. Given a raw user prompt, [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: The construction pipeline and example samples of the proposed [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 5.** Figure 5: T-SNE visualization of text embeddings of anomaly definitions from [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 4.** Figure 4: Statistics of OpenDef-Bench. With high video resolution, long video [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 6.** Figure 6: Comparison of different token compression methods [47]–[49] under [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

Open-world video anomaly detection (OWVAD) aims to detect and explain abnormal events under different anomaly definitions, which is important for applications such as intelligent surveillance and live-streaming content moderation. Recent MLLM-based methods have shown promising open-world generalization, but still suffer from three major limitations: inefficiency for practical deployment, lack of streaming processing adaptation, and limited support for dynamic anomaly definitions in both modeling and evaluation. To address these issues, this paper proposes ESOM, an efficient streaming OWVAD model that operates in a training-free manner. ESOM includes a Definition Normalization module to structure user prompts for reducing hallucination, an Inter-frame-matched Intra-frame Token Merging module to compress redundant visual tokens, a Hybrid Streaming Memory module for efficient causal inference, and a Probabilistic Scoring module that converts interval-level textual outputs into frame-level anomaly scores. In addition, this paper introduces OpenDef-Bench, a new benchmark with clean surveillance videos and diverse natural anomaly definitions for evaluating performance under varying conditions. Extensive experiments show that ESOM achieves real-time efficiency on a single GPU and state-of-the-art performance in anomaly temporal localization, classification, and description generation. The code and benchmark will be released at https://github.com/Kamino666/ESOM_OpenDef-Bench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ESOM adds four targeted modules for training-free streaming OWVAD plus a new benchmark, but the SOTA and real-time claims rest on details not visible in the abstract.

read the letter

The paper's core contribution is a training-free pipeline that adapts MLLMs to streaming video with dynamic anomaly definitions. It breaks the problem into Definition Normalization to clean up user prompts and limit hallucinations, Inter-frame-matched Intra-frame Token Merging to cut redundant visual tokens, Hybrid Streaming Memory to keep causal inference efficient, and Probabilistic Scoring to turn interval text into frame-level scores. They also release OpenDef-Bench, built on clean surveillance footage and a range of natural anomaly definitions, to test exactly this flexibility. That combination is new relative to earlier MLLM-based OWVAD work, which mostly ignored streaming constraints and used fixed or narrow evaluation sets. The architecture lines up logically with the three limitations they name: inefficiency, lack of streaming support, and weak handling of changing definitions. The benchmark itself is a concrete step forward because it forces models to handle varied prompts on the same videos. The soft spots sit in the results section. The abstract states real-time single-GPU performance and SOTA numbers on localization, classification, and description, yet the provided summary gives no baselines, ablation numbers, or error breakdowns, so it is impossible to judge how much the modules actually deliver versus how much the underlying MLLM carries. The central assumption—that these four pieces together produce reliable open-world detection without training—looks internally consistent on paper, but it needs the full experimental tables to hold weight. No circular reasoning or hidden contradictions appear in the construction. This work is aimed at people building surveillance or moderation systems who need both speed and the ability to change what counts as anomalous on the fly. Anyone already using MLLMs for video would find the modules and benchmark worth looking at. It deserves a serious referee because the practical focus and new evaluation resource address gaps that matter for deployment, even if the performance numbers require closer checking.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes ESOM, a training-free model for open-world streaming video anomaly detection (OWVAD). It introduces four modules—Definition Normalization to structure prompts and reduce hallucination, Inter-frame-matched Intra-frame Token Merging to compress visual tokens, Hybrid Streaming Memory for causal inference, and Probabilistic Scoring to derive frame-level anomaly scores from textual outputs—along with the new OpenDef-Bench benchmark containing clean surveillance videos and diverse natural anomaly definitions. The central claim is that ESOM achieves real-time single-GPU inference and state-of-the-art performance on anomaly temporal localization, classification, and description generation.

Significance. If the performance and efficiency claims are substantiated, this work would represent a meaningful step toward practical OWVAD systems for surveillance and live-streaming moderation. The training-free design, streaming adaptation, and support for dynamic definitions address documented limitations of prior MLLM-based approaches, while the new benchmark could enable more rigorous evaluation of open-world generalization.

major comments (2)

Abstract: The claim of achieving state-of-the-art performance in temporal localization, classification, and description generation is presented without reference to specific baselines, quantitative metrics, or error analysis, which prevents full assessment of whether the four modules deliver the asserted gains over existing methods.
The weakest assumption—that Definition Normalization, Inter-frame-matched Intra-frame Token Merging, Hybrid Streaming Memory, and Probabilistic Scoring together enable effective open-world detection, hallucination reduction, and causal streaming inference without any training—requires explicit ablation or component-wise results to confirm it is load-bearing for the SOTA claim.

minor comments (1)

The commitment to release code and the OpenDef-Bench benchmark is noted positively; ensure the release includes all prompts, preprocessing details, and evaluation scripts to support reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight opportunities to better substantiate our claims. We address each major comment below and indicate the corresponding revisions to the manuscript.

read point-by-point responses

Referee: Abstract: The claim of achieving state-of-the-art performance in temporal localization, classification, and description generation is presented without reference to specific baselines, quantitative metrics, or error analysis, which prevents full assessment of whether the four modules deliver the asserted gains over existing methods.

Authors: We agree that the abstract would benefit from greater specificity to allow readers to evaluate the SOTA claim more readily. The experiments section already contains detailed quantitative comparisons against relevant baselines for temporal localization (F1-score), classification accuracy, and description generation quality, together with supporting error analysis. In the revised manuscript we have updated the abstract to include brief references to these key metrics and baselines while directing readers to the full tables and analysis in the main text. This change strengthens the abstract without altering the reported experimental outcomes. revision: yes
Referee: The weakest assumption—that Definition Normalization, Inter-frame-matched Intra-frame Token Merging, Hybrid Streaming Memory, and Probabilistic Scoring together enable effective open-world detection, hallucination reduction, and causal streaming inference without any training—requires explicit ablation or component-wise results to confirm it is load-bearing for the SOTA claim.

Authors: The referee correctly notes that component-wise validation would more rigorously demonstrate the contribution of each module to the overall performance. While the original manuscript reports end-to-end results for the complete ESOM system, we acknowledge the value of explicit ablations. We have added a dedicated ablation study in the revised version that isolates the effect of removing or altering each module individually. The new results quantify impacts on hallucination rates, inference latency, and detection metrics, confirming that the four modules are collectively load-bearing for the training-free open-world performance gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents ESOM as a training-free composition of four explicitly described modules (Definition Normalization to structure prompts and reduce hallucination, Inter-frame-matched Intra-frame Token Merging for token compression, Hybrid Streaming Memory for causal streaming inference, and Probabilistic Scoring to convert interval outputs to frame-level scores) plus the newly introduced OpenDef-Bench benchmark. No equations, fitted parameters, or derivations appear in the abstract or module descriptions that reduce any claimed output to an input by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The performance claims rest on experimental results on the external benchmark rather than internal re-labeling or self-referential fitting, making the derivation chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Limited information from abstract only; no specific free parameters or axioms detailed beyond standard use of MLLMs and computer vision techniques.

pith-pipeline@v0.9.0 · 5534 in / 1158 out tokens · 43589 ms · 2026-05-10T17:51:34.577081+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Breath1024.lean period8 echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

a GoF structure with size 8 is adopted, where the first frame is treated as an I-frame, the last frame as a P-frame, and the remaining frames as B-frames. Correspondingly, the token retention ratios for B-frames and P-frames are set to γ_B = 0.2 and γ_P = 0.6
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The DN module converts user prompt into a structured anomaly definition table to reduce hallucination... Probabilistic Scoring (PS) module that converts interval-level textual outputs into frame-level anomaly scores

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 12 canonical work pages · 4 internal anchors

[1]

Deep learning for video anomaly detection: A review,

P. Wu, C. Pan, Y . Yan, G. Pang, P. Wang, and Y . Zhang, “Deep learning for video anomaly detection: A review,”arxiv preprint, vol. abs/2409.05383, 2024

work page arXiv 2024
[2]

Open- vocabulary video anomaly detection,

P. Wu, X. Zhou, G. Pang, Y . Sun, J. Liu, P. Wang, and Y . Zhang, “Open- vocabulary video anomaly detection,” inCVPR. Seattle, USA: IEEE, 2024, pp. 18 297–18 307

2024
[3]

Anomize: Better open vocabulary video anomaly detection,

F. Li, W. Liu, J. Chen, R. Zhang, Y . Wang, X. Zhong, and Z. Wang, “Anomize: Better open vocabulary video anomaly detection,” inCVPR, 2025, pp. 29 203–29 212

2025
[4]

Mul- timodal evidential learning for open-world weakly-supervised video anomaly detection,

C. Huang, W. Huang, Q. Jiang, W. Wang, J. Wen, and B. Zhang, “Mul- timodal evidential learning for open-world weakly-supervised video anomaly detection,”TMM, 2025. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 10

2025
[5]

Harness- ing large language models for training-free video anomaly detection,

L. Zanella, W. Menapace, M. Mancini, Y . Wang, and E. Ricci, “Harness- ing large language models for training-free video anomaly detection,” inCVPR. Seattle, USA: IEEE, 2024, pp. 18 527–18 536

2024
[6]

Holmes-vau: Towards long-term video anomaly understanding at any granularity,

H. Zhang, X. Xu, X. Wang, J. Zuo, X. Huang, C. Gao, S. Zhang, L. Yu, and N. Sang, “Holmes-vau: Towards long-term video anomaly understanding at any granularity,” inCVPR, 2025, pp. 13 843–13 853

2025
[7]

Language-guided open- world video anomaly detection under weak supervision,

Z. Liu, X. Wu, J. Wu, X. Wang, and L. Yang, “Language-guided open- world video anomaly detection under weak supervision,”arXiv preprint arXiv:2503.13160, 2025

work page arXiv 2025
[8]

Vadclip: Adapting vision-language models for weakly supervised video anomaly detection,

P. Wu, X. Zhou, G. Pang, L. Zhou, Q. Yan, P. Wang, and Y . Zhang, “Vadclip: Adapting vision-language models for weakly supervised video anomaly detection,” inAAAI, M. J. Wooldridge, J. G. Dy, and S. Natara- jan, Eds. Vancouver, Canada: AAAI Press, 2024, pp. 6074–6082

2024
[9]

Glancevad: Exploring glance supervision for label-efficient video anomaly detection,

H. Zhang, X. Wang, X. Xu, X. Huang, C. Gao, Y . Wang, S. Zhang, and N. Sang, “Glancevad: Exploring glance supervision for label-efficient video anomaly detection,” inICME. IEEE, 2025, pp. 1–6

2025
[10]

Dctformer: A dual- branch transformer with cloze tests for video anomaly detection,

P. Chen, S. Du, X. Zhao, J. Hu, J. Li, and T. Li, “Dctformer: A dual- branch transformer with cloze tests for video anomaly detection,”TMM, 2025

2025
[11]

Vera: Explainable video anomaly detection via verbalized learning of vision-language models,

M. Ye, W. Liu, and P. He, “Vera: Explainable video anomaly detection via verbalized learning of vision-language models,” inCVPR, 2025, pp. 8679–8688

2025
[12]

Deep appearance features for abnormal behavior detection in video,

S. Smeureanu, R. T. Ionescu, M. Popescu, and B. Alexe, “Deep appearance features for abnormal behavior detection in video,” inInt. Conf. Image Anal. Proc., ser. Lecture Notes in Computer Science, vol. 10485. Springer, 2017, pp. 779–789

2017
[13]

A deep one-class neural network for anomalous event detection in complex scenes,

P. Wu, J. Liu, and F. Shen, “A deep one-class neural network for anomalous event detection in complex scenes,”TNNLS, vol. 31, no. 7, pp. 2609–2622, 2020

2020
[14]

Hawk: Learning to understand open-world video anomalies,

J. Tang, H. LU, R. WU, X. Xu, K. Ma, C. Fang, B. Guo, J. Lu, Q. Chen, and Y . Chen, “Hawk: Learning to understand open-world video anomalies,” inNeurIPS, vol. 37, 2024, pp. 139 751–139 785

2024
[15]

Vad-r1: Towards video anomaly reasoning via perception-to-cognition chain-of-thought,

C. Huang, B. Wang, J. Wen, C. Liu, W. Wang, L. Shen, and X. Cao, “Vad-r1: Towards video anomaly reasoning via perception-to-cognition chain-of-thought,”arXiv preprint arXiv:2505.19877, 2025

work page arXiv 2025
[16]

Vau-r1: Advancing video anomaly understanding via reinforcement fine-tuning,

L. Zhu, Q. Chen, X. Shen, and X. Cun, “Vau-r1: Advancing video anomaly understanding via reinforcement fine-tuning,”arXiv preprint arXiv:2505.23504, 2025

work page arXiv 2025
[17]

Panda: Towards generalist video anomaly detection via agentic ai engineer,

Z. Yang, C. Gao, and M. Z. Shou, “Panda: Towards generalist video anomaly detection via agentic ai engineer,”arXiv preprint arXiv:2509.26386, 2025

work page arXiv 2025
[18]

Roformer: En- hanced transformer with rotary position embedding,

J. Su, M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu, “Roformer: En- hanced transformer with rotary position embedding,”Neurocomputing, vol. 568, p. 127063, 2024

2024
[19]

Towards open set video anomaly detection,

Y . Zhu, W. Bao, and Q. Yu, “Towards open set video anomaly detection,” inECCV, ser. Lecture Notes in Computer Science, S. Avidan, G. J. Brostow, M. Ciss ´e, G. M. Farinella, and T. Hassner, Eds., vol. 13694. Tel Aviv,Israel: Springer, 2022, pp. 395–412

2022
[20]

Enabling real-world supervised video anomaly detection: New open-set benchmark and new framework,

H. Huang, Z. Hu, D. Feng, C. Chen, D. Li, H. Liu, and L. Duan, “Enabling real-world supervised video anomaly detection: New open-set benchmark and new framework,”TMM, 2026

2026
[21]

Ubnormal: New benchmark for supervised open-set video anomaly detection,

A. Acsintoae, A. Florescu, M. Georgescu, T. Mare, P. Sumedrea, R. T. Ionescu, F. S. Khan, and M. Shah, “Ubnormal: New benchmark for supervised open-set video anomaly detection,” inCVPR. New Orleans, USA: IEEE, 2022, pp. 20 111–20 121

2022
[22]

Domain generalization for video anomaly detection considering diverse anomaly types,

Z. Wang, X. Gu, H. Yan, and X. Gu, “Domain generalization for video anomaly detection considering diverse anomaly types,”Signal, Image and Video Processing, vol. 18, no. 4, pp. 3691–3704, 2024

2024
[23]

Cross-domain learning for video anomaly detection with limited supervision,

Y . Jain, A. Dabouei, and M. Xu, “Cross-domain learning for video anomaly detection with limited supervision,” inECCV. Springer, 2024, pp. 468–484

2024
[24]

Cross-domain video anomaly detection without target domain adaptation,

A. Aich, K.-C. Peng, and A. K. Roy-Chowdhury, “Cross-domain video anomaly detection without target domain adaptation,” inWACV, 2023, pp. 2579–2591

2023
[25]

Human- centric behavior description in videos: New benchmark and model,

L. Zhou, Y . Gao, M. Zhang, P. Wu, P. Wang, and Y . Zhang, “Human- centric behavior description in videos: New benchmark and model,” TMM, vol. 26, pp. 10 867–10 878, 2024

2024
[26]

Anomaly-led prompting learning caption generating model and benchmark,

Q. Bao, F. Liu, L. Jiao, Y . Liu, S. Li, L. Li, X. Liu, X. Wang, and B. Chen, “Anomaly-led prompting learning caption generating model and benchmark,”TMM, 2025

2025
[27]

Uncovering what, why and how: A comprehensive benchmark for causation understanding of video anomaly,

H. Du, S. Zhang, B. Xie, G. Nan, J. Zhang, J. Xu, H. Liu, S. Leng, J. Liu, H. Fan, D. Huang, J. Feng, L. Chen, C. Zhang, X. Li, H. Zhang, J. Chen, Q. Cui, and X. Tao, “Uncovering what, why and how: A comprehensive benchmark for causation understanding of video anomaly,” inCVPR. IEEE, 2024, pp. 18 793–18 803

2024
[28]

Towards surveillance video-and-language understanding: New dataset, baselines, and challenges,

T. Yuan, X. Zhang, K. Liu, B. Liu, C. Chen, J. Jin, and Z. Jiao, “Towards surveillance video-and-language understanding: New dataset, baselines, and challenges,” inCVPR, 2024, pp. 22 052–22 061

2024
[29]

Enhancing video anomaly understanding via multi-task instruction tuning,

X. Wang, X. Wu, and Z. Liu, “Enhancing video anomaly understanding via multi-task instruction tuning,”IEEE Signal Processing Letters, vol. 32, pp. 4359–4363, 2025

2025
[30]

Towards multi-domain learning for generalizable video anomaly detection,

M. Cho, T. Kim, M. Shim, D. Wee, and S. Lee, “Towards multi-domain learning for generalizable video anomaly detection,”NeurIPS, vol. 37, pp. 50 256–50 284, 2024

2024
[31]

Adaptive keyframe sampling for long video understanding,

X. Tang, J. Qiu, L. Xie, Y . Tian, J. Jiao, and Q. Ye, “Adaptive keyframe sampling for long video understanding,” inCVPR, 2025, pp. 29 118– 29 128

2025
[32]

Framefusion: Combining similarity and importance for video token reduction on large vision language models,

T. Fu, T. Liu, Q. Han, G. Dai, S. Yan, H. Yang, X. Ning, and Y . Wang, “Framefusion: Combining similarity and importance for video token reduction on large vision language models,” inICCV, 2025, pp. 22 654– 22 663

2025
[33]

Overview of the h. 264/avc video coding standard,

T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overview of the h. 264/avc video coding standard,”TCSVT, vol. 13, no. 7, pp. 560–576, 2003

2003
[34]

Efficient streaming language models with attention sinks,

G. Xiao, Y . Tian, B. Chen, S. Han, and M. Lewis, “Efficient streaming language models with attention sinks,”arXiv, 2023

2023
[35]

Rethinking met- rics and benchmarks of video anomaly detection,

Z. Liu, X. Wu, W. Li, L. Yang, and S. Wang, “Rethinking met- rics and benchmarks of video anomaly detection,”arXiv preprint arXiv:2505.19022, 2025

work page arXiv 2025
[36]

Chain-of-thought reasoning without prompting,

X. Wang and D. Zhou, “Chain-of-thought reasoning without prompting,” NeurIPS, vol. 37, pp. 66 383–66 409, 2024

2024
[37]

Graph convolutional label noise cleaner: Train a plug-and-play action classifier for anomaly detection,

J. Zhong, N. Li, W. Kong, S. Liu, T. H. Li, and G. Li, “Graph convolutional label noise cleaner: Train a plug-and-play action classifier for anomaly detection,” inCVPR, 2019, pp. 1237–1246

2019
[38]

A new comprehensive bench- mark for semi-supervised video anomaly detection and anticipation,

C. Cao, Y . Lu, P. Wang, and Y . Zhang, “A new comprehensive bench- mark for semi-supervised video anomaly detection and anticipation,” in CVPR. Vancouver, Canada: IEEE, 2023, pp. 20 392–20 401

2023
[39]

A large-scale benchmark dataset for event recognition in surveillance video,

S. Oh, A. Hoogs, A. Perera, N. Cuntoor, C.-C. Chen, J. T. Lee, S. Mukherjee, J. K. Aggarwal, H. Lee, L. Daviset al., “A large-scale benchmark dataset for event recognition in surveillance video,” inCVPR. IEEE, 2011, pp. 3153–3160

2011
[40]

Meva: A large-scale multiview, multimodal video dataset for activity detection,

K. Corona, K. Osterdahl, R. Collins, and A. Hoogs, “Meva: A large-scale multiview, multimodal video dataset for activity detection,” inWACV, 2021, pp. 1060–1068

2021
[41]

Real-world anomaly detection in surveillance videos,

W. Sultani, C. Chen, and M. Shah, “Real-world anomaly detection in surveillance videos,” inCVPR. Computer Vision Foundation / IEEE Computer Society, 2018, pp. 6479–6488

2018
[42]

Not only look, but also listen: Learning multimodal violence detection under weak supervision,

P. Wu, J. Liu, Y . Shi, Y . Sun, F. Shao, Z. Wu, and Z. Yang, “Not only look, but also listen: Learning multimodal violence detection under weak supervision,” inECCV, ser. Lecture Notes in Computer Science, vol. 12375. Springer, 2020, pp. 322–339

2020
[43]

Advancing video anomaly detection: A concise review and a new dataset,

L. Zhu, L. Wang, A. Raj, T. Gedeon, and C. Chen, “Advancing video anomaly detection: A concise review and a new dataset,” inNeurIPS, vol. 37. Curran Associates, Inc., 2024, pp. 89 943–89 977

2024
[44]

Qwen3 Technical Report

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Learning prompt-enhanced context features for weakly-supervised video anomaly detection,

Y . Pu, X. Wu, L. Yang, and S. Wang, “Learning prompt-enhanced context features for weakly-supervised video anomaly detection,”TIP, vol. 33, pp. 4923–4936, 2024

2024
[46]

Streamavatar: Streaming diffusion models for real-time interactive human avatars,

Z. Sun, Z. Peng, Y . Ma, Y . Chen, Z. Zhou, Z. Zhou, G. Zhang, Y . Zhang, Y . Zhou, Q. Luet al., “Streamavatar: Streaming diffusion models for real-time interactive human avatars,”arXiv preprint arXiv:2512.22065, 2025

work page arXiv 2025
[47]

Token Merging: Your ViT But Faster

D. Bolya, C.-Y . Fu, X. Dai, P. Zhang, C. Feichtenhofer, and J. Hoffman, “Token merging: Your vit but faster,”arXiv preprint arXiv:2210.09461, 2022

work page internal anchor Pith review arXiv 2022
[48]

Dycoke: Dynamic compression of tokens for fast video large language models,

K. Tao, C. Qin, H. You, Y . Sui, and H. Wang, “Dycoke: Dynamic compression of tokens for fast video large language models,” inCVPR, 2025, pp. 18 992–19 001

2025
[49]

Kwai keye-vl 1.5 technical report.arXiv preprint arXiv:2509.01563, 2025

B. Yang, B. Wen, B. Ding, C. Liu, C. Chu, C. Song, C. Rao, C. Yi, D. Li, D. Zanget al., “Kwai keye-vl 1.5 technical report,”arXiv preprint arXiv:2509.01563, 2025

work page arXiv 2025
[50]

Qwen2.5-VL Technical Report

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y . Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y . Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin, “Qwen2.5-VL Technical Report,”arXiv preprint arXiv:2502.13923, Feb. 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

Qwen3-VL Technical Report

S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Geet al., “Qwen3-vl technical report,”arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025