pith. sign in

arxiv: 2607.00622 · v1 · pith:RBOCU4QUnew · submitted 2026-07-01 · 💻 cs.CV

Learning to Watch: Active Video Anomaly Understanding via Interleaved Policy Optimization

Pith reviewed 2026-07-02 14:36 UTC · model grok-4.3

classification 💻 cs.CV
keywords video anomaly understandingactive decision makingpolicy optimizationweak supervisiontemporal operatorsevidence inquiryinterleaved policyclosed-loop video analysis
0
0 comments X

The pith

Anom-π turns video anomaly understanding into active sequential decisions so a 2B-parameter model can outperform larger passive systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that passive sampling creates observational aliasing because static frames cannot separate semantically distinct events in video. It reconceptualizes the task as a closed-loop process in which the model interleaves reasoning with evidence gathering through operators such as backtracking, temporal expansion, and fine-grained sampling. These behaviors are trained end-to-end under video-level labels by Interactive Direct Preference Optimization driven by an Active Evidence Inquiry utility that trades off success, informativeness, and cost. The central result is that the resulting policy learns to disambiguate hypotheses while avoiding redundant actions, yielding competitive accuracy even with only 2B parameters. A reader would care because the shift from passive observation to strategic interaction directly addresses why current large models still fail on ambiguous cases.

Core claim

Anom-π unifies internal cognitive reasoning and strategic evidence acquisition into an interleaved policy that employs temporal atomic operators such as local backtracking, temporal expansion, and fine-grained sampling. These interaction strategies are learned under video-level weak supervision by Interactive Direct Preference Optimization (iDPO) guided by an Active Evidence Inquiry (AEI) utility that balances task success, informative evidence acquisition, and interaction cost, enabling the agent to actively disambiguate hypotheses while suppressing redundant exploration.

What carries the argument

The interleaved policy whose actions are temporal atomic operators and whose learning signal is iDPO alignment under the AEI utility.

If this is right

  • The agent acquires evidence selectively rather than exhaustively, reducing interaction cost while raising task success.
  • Trajectory-level preference optimization succeeds even when only video-level labels are available.
  • Smaller models become competitive in complex VAU scenarios once perceptual proactivity is enabled.
  • Redundant exploration is suppressed because the AEI utility penalizes uninformative actions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same active-policy structure could be applied to other sparse-cue video tasks such as action anticipation or long-form summarization.
  • Distilling the learned operators into a passive model might improve conventional VAU baselines without runtime interaction.
  • If interaction cost is further weighted, the framework could support real-time deployment on edge devices where full video review is expensive.

Load-bearing premise

Video-level weak supervision plus the AEI utility is sufficient to produce non-trivial, non-collapsing interaction policies.

What would settle it

A controlled test in which the iDPO-trained policy produces the same accuracy as a fixed passive sampler on a set of deliberately ambiguous anomaly videos would show the active strategy adds no value.

Figures

Figures reproduced from arXiv: 2607.00622 by Jiaxu Leng, Mengjingcheng Mo, Xinbo Gao.

Figure 1
Figure 1. Figure 1: Comparison of passive and active evidence acquisition for video anomaly understanding. (a) Existing VAU methods use static sampling in the verbal space, which may miss sparse anomaly evidence. (b) Anom-π actively explores the video in the visual space to gather informative observations and detect anomalies. sity. The task requires localizing anomalous events in com￾plex video streams and producing verifiab… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of Anom-π. Left: closed-loop active inquiry that alternates deliberation (THINK) with evidence acquisition (BACKTRACK/EXPAND/SAMPLE) before terminating with FINAL to output clip-level hypotheses. Right: trajectory-level preference construction and optimization via iDPO from ranked rollouts. pipelines. Complementary approaches improve sensitiv￾ity to sparse anomalies through lightweight supervision… view at source ↗
Figure 3
Figure 3. Figure 3: Instantaneous anomaly case. Anom-π refines temporal boundaries by probing around a suspected moment and then termi￾nates early. about 54 GB and 2 hours for iDPO. We therefore view GRPO as complementary rather than inferior, and leave GRPO training with richer rewards as future work. Action space analysis. We evaluate the contribution of different actions by removing them from the action space A. Without BA… view at source ↗
Figure 4
Figure 4. Figure 4: Temporal anomaly score curves. We compare Anom-π with VadCLIP, LAVAD, and VERA on three scenarios. Yellow indicates the ground-truth anomalous interval and blue indicates predicted anomalous spans after thresholding the anomaly score at 0.5. ERR HRR Premature Rate 0 20 40 60 Percentage (%) 5.4% 4.2% 52.1% 40.1% 12.5% 41.2% 36.5% 18.3% 35.8% 15.9% 21.2% 27.7% ZERO-SHOT SFT (pos-only) DPO (w/o curiosity) iDP… view at source ↗
Figure 5
Figure 5. Figure 5: Behavioral diagnostics. Evidence Refinement Rate (ERR), Hypothesis Revision Rate (HRR), and premature finaliza￾tion rate under different training objectives. 5.5. Visualization Analysis Active interaction verifies subtle anomalies. The example in [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison cases I. Anom-π is shown in (a,c), and the open-loop QwenVL baseline is shown in (b,d). 23 [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison cases II. Anom-π is shown in (e,g), and the open-loop QwenVL baseline is shown in (f,h). 24 [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Failure-case visualizations. (a) Redundant exploration under low visual quality. (b) Premature stopping under persistent temporal ambiguity. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗
read the original abstract

Video anomaly understanding (VAU) relies on sparse, context-dependent cues. However, existing passive paradigms suffer from observational aliasing, where static sampling fails to disambiguate semantically distinct events. To overcome this, we propose $Anom\text{-}\pi$, a closed-loop framework that reconceptualizes video understanding as an active sequential decision-making process within a dynamic environment. Inspired by human video-reviewing behavior, this framework unifies internal cognitive reasoning and strategic evidence acquisition into an interleaved policy, utilizing temporal atomic operators such as local backtracking, temporal expansion, and fine-grained sampling to endow the model with perceptual proactivity. To learn such complex interaction strategies under video-level weak supervision, we design Interactive Direct Preference Optimization (iDPO) to achieve trajectory-level policy alignment, guided by an Active Evidence Inquiry (AEI) utility that balances task success, informative evidence acquisition, and interaction cost. This approach enables the agent to learn to actively disambiguate hypotheses while suppressing redundant exploration. Extensive experiments demonstrate that our framework, with only 2B parameters, achieves highly competitive performance, significantly outperforming state-of-the-art large-scale VAU models in complex scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes Anom-π, a closed-loop active framework for video anomaly understanding that models the task as sequential decision-making. It introduces temporal atomic operators (local backtracking, temporal expansion, fine-grained sampling) within an interleaved policy, trained via Interactive Direct Preference Optimization (iDPO) under an Active Evidence Inquiry (AEI) utility that balances task success, evidence informativeness, and interaction cost, all using only video-level weak supervision. The central claim is that this 2B-parameter model achieves highly competitive performance and significantly outperforms state-of-the-art large-scale passive VAU models in complex scenarios.

Significance. If the empirical claims hold and the active policy demonstrably avoids collapse, the work would be significant for shifting VAU from passive sampling to human-inspired active evidence acquisition, potentially enabling smaller models to resolve observational aliasing in context-dependent anomalies. The trajectory-level alignment via iDPO and the multi-objective AEI utility represent a substantive technical contribution over standard supervised or reinforcement baselines.

major comments (1)
  1. [iDPO and AEI utility description (abstract and method)] The headline performance claim (2B-param model significantly outperforming large VAU models via active strategies) is load-bearing on the assumption that iDPO under AEI produces non-redundant use of the temporal atomic operators. With only video-level weak supervision and no mention of an explicit diversity term, entropy bonus, or trajectory-level contrastive loss, it is unclear what mechanism prevents policy collapse to a single operator or constant querying behavior; this must be addressed with either a formal argument or ablation evidence showing distinct operator usage distributions.
minor comments (1)
  1. [Abstract] The abstract asserts 'extensive experiments' and 'superior performance' without naming baselines, metrics, datasets, or result tables; while the full manuscript presumably supplies these, the abstract should at minimum indicate the evaluation protocol.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the concern regarding mechanisms preventing policy collapse in iDPO under AEI below and will revise the manuscript to strengthen this section.

read point-by-point responses
  1. Referee: [iDPO and AEI utility description (abstract and method)] The headline performance claim (2B-param model significantly outperforming large VAU models via active strategies) is load-bearing on the assumption that iDPO under AEI produces non-redundant use of the temporal atomic operators. With only video-level weak supervision and no mention of an explicit diversity term, entropy bonus, or trajectory-level contrastive loss, it is unclear what mechanism prevents policy collapse to a single operator or constant querying behavior; this must be addressed with either a formal argument or ablation evidence showing distinct operator usage distributions.

    Authors: The AEI utility is explicitly formulated as a multi-objective function balancing task success, evidence informativeness, and interaction cost. The interaction-cost term directly penalizes redundant or constant querying, while the informativeness objective requires context-dependent selection among the temporal atomic operators (local backtracking, temporal expansion, fine-grained sampling) to resolve aliasing. iDPO performs trajectory-level preference alignment on trajectories that achieve high AEI utility, thereby favoring policies that employ varied operators over collapsed ones. Although an explicit entropy bonus or diversity regularizer is not present, the composite AEI objective supplies the necessary regularization. In the revised manuscript we will add a formal derivation of how AEI discourages collapse together with ablation results on operator-usage distributions across scenarios. revision: yes

Circularity Check

0 steps flagged

No circularity: framework claims rest on empirical performance, not self-referential definitions or fitted predictions.

full rationale

The provided abstract and description contain no equations, derivations, or parameter-fitting steps that reduce claimed outputs (e.g., performance gains from iDPO under AEI) to inputs by construction. No self-citation chains, uniqueness theorems, or ansatzes are invoked to justify core components. The method is presented as a novel combination of policy optimization and utility design whose validity is asserted via experiments rather than algebraic identity. This matches the default expectation of non-circularity for papers without explicit mathematical reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 4 invented entities

Based solely on the abstract, the claim rests on several new introduced components and domain assumptions without independent evidence or verification.

axioms (2)
  • domain assumption Existing passive paradigms suffer from observational aliasing where static sampling fails to disambiguate events
    Stated as the core problem motivating the active approach.
  • ad hoc to paper Video-level weak supervision suffices to learn complex interaction strategies via iDPO
    Required for training the policy without stronger labels.
invented entities (4)
  • Anom-π framework no independent evidence
    purpose: Closed-loop active VAU system
    New proposed architecture unifying reasoning and evidence acquisition.
  • iDPO no independent evidence
    purpose: Trajectory-level policy alignment under weak supervision
    New optimization method described.
  • AEI utility no independent evidence
    purpose: Balances task success, informative evidence, and interaction cost
    New guiding utility for the policy.
  • temporal atomic operators no independent evidence
    purpose: Enable perceptual proactivity via backtracking, expansion, and sampling
    New set of operators introduced.

pith-pipeline@v0.9.1-grok · 5737 in / 1481 out tokens · 32699 ms · 2026-07-02T14:36:56.541746+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 4 canonical work pages · 3 internal anchors

  1. [1]

    Proceedings of the computer vision and pattern recognition conference , pages=

    Holmes-vau: Towards long-term video anomaly understanding at any granularity , author=. Proceedings of the computer vision and pattern recognition conference , pages=

  2. [2]

    arXiv preprint arXiv:2505.23504 , year=

    Vau-r1: Advancing video anomaly understanding via reinforcement fine-tuning , author=. arXiv preprint arXiv:2505.23504 , year=

  3. [3]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency , author=. arXiv preprint arXiv:2508.18265 , year=

  4. [4]

    Advances in neural information processing systems , volume=

    Vad-r1: Towards video anomaly reasoning via perception-to-cognition chain-of-thought , author=. Advances in neural information processing systems , volume=

  5. [5]

    Advances in Neural Information Processing Systems , volume=

    A2seek: Towards reasoning-centric benchmark for aerial anomaly understanding , author=. Advances in Neural Information Processing Systems , volume=

  6. [6]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Iad-r1: Reinforcing consistent reasoning in industrial anomaly detection , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  7. [7]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Cuebench: Advancing unified understanding of context-aware video anomalies in real-world , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  8. [8]

    Proceedings of the Fourteenth International Conference on Learning Representations , year =

    JUDO: A Juxtaposed Domain-Oriented Multimodal Reasoner for Industrial Anomaly QA , author =. Proceedings of the Fourteenth International Conference on Learning Representations , year =

  9. [9]

    Proceedings of the 33rd ACM International Conference on Multimedia , pages=

    Hiprobe-vad: Video anomaly detection via hidden states probing in tuning-free multimodal llms , author=. Proceedings of the 33rd ACM International Conference on Multimedia , pages=

  10. [10]

    Advances in Neural Information Processing Systems , volume=

    Panda: Towards generalist video anomaly detection via agentic ai engineer , author=. Advances in Neural Information Processing Systems , volume=

  11. [11]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Harnessing large language models for training-free video anomaly detection , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  12. [12]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    VERA: Explainable Video Anomaly Detection via Verbalized Learning of Vision-Language Models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  13. [13]

    Advances in Neural Information Processing Systems , volume=

    VADTree: Explainable training-free video anomaly detection via hierarchical granularity-aware tree , author=. Advances in Neural Information Processing Systems , volume=

  14. [14]

    Advances in Neural Information Processing Systems , volume=

    A unified reasoning framework for holistic zero-shot video anomaly analysis , author=. Advances in Neural Information Processing Systems , volume=

  15. [15]

    Proceedings of the 33rd ACM International Conference on Multimedia , pages=

    Eventvad: Training-free event-aware video anomaly detection , author=. Proceedings of the 33rd ACM International Conference on Multimedia , pages=

  16. [16]

    Follow the rules: Reasoning for video anomaly detection with large language models , author=. Proc. Eur. Conf. Comput. Vis. (ECCV) , pages=

  17. [17]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Improved baselines with visual instruction tuning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  18. [18]

    International conference on machine learning , pages=

    Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

  19. [19]

    Huang, Chao and Shi, Yushu and Wen, Jie and Wang, Wei and Xu, Yong and Cao, Xiaochun , booktitle=. Ex-. 2025 , volume=

  20. [20]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    VadCLIP: Adapting Vision-Language Models for Weakly Supervised Video Anomaly Detection , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  21. [21]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Dual memory units with uncertainty regulation for weakly supervised video anomaly detection , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  22. [22]

    Proceedings of the IEEE/CVF international conference on computer vision , pages=

    Weakly-supervised video anomaly detection with robust temporal feature magnitude learning , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

  23. [23]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Video Anomaly Detection via Spatio-Temporal Pseudo-Anomaly Generation: A Unified Approach , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  24. [24]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Self-distilled masked auto-encoders are efficient video anomaly detectors , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  25. [25]

    European conference on computer vision , pages=

    Not only look, but also listen: Learning multimodal violence detection under weak supervision , author=. European conference on computer vision , pages=. 2020 , organization=

  26. [26]

    Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

    Real-world anomaly detection in surveillance videos , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

  27. [27]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Ubnormal: New benchmark for supervised open-set video anomaly detection , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  28. [28]

    Nature , volume=

    DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning , author=. Nature , volume=. 2025 , publisher=

  29. [29]

    Advances in neural information processing systems , volume=

    Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

  30. [30]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    HeadHunt-VAD: Hunting Robust Anomaly-Sensitive Heads in MLLM for Tuning-Free Video Anomaly Detection , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  31. [31]

    Proceedings of the 42nd International Conference on Machine Learning , pages=

    Learning Event Completeness for Weakly Supervised Video Anomaly Detection , author=. Proceedings of the 42nd International Conference on Machine Learning , pages=. 2025 , volume=

  32. [32]

    Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

    Just Dance with pi! A Poly-modal Inductor for Weakly-supervised Video Anomaly Detection , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

  33. [33]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Prompt-enhanced multiple instance learning for weakly supervised video anomaly detection , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  34. [34]

    Advances in Neural Information Processing Systems , volume=

    Hawk: Learning to understand open-world video anomalies , author=. Advances in Neural Information Processing Systems , volume=

  35. [35]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Aligning Effective Tokens with Video Anomaly in Large Language Models , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  36. [36]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  37. [37]

    Qwen2.5-VL Technical Report

    Qwen2.5-vl technical report , author=. arXiv preprint arXiv:2502.13923 , year=

  38. [38]

    Qwen3 Technical Report

    Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

  39. [39]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Uncovering what why and how: A comprehensive benchmark for causation understanding of video anomaly , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  40. [40]

    Advances in neural information processing systems , volume=

    Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=

  41. [41]

    IEEE Transactions on Information Forensics and Security , year=

    Video-Level Language-Driven Video-Based Visible-Infrared Person Re-Identification , author=. IEEE Transactions on Information Forensics and Security , year=

  42. [42]

    IEEE Transactions on Neural Networks and Learning Systems , volume=

    Logical relation inference and multiview information interaction for domain adaptation person re-identification , author=. IEEE Transactions on Neural Networks and Learning Systems , volume=. 2023 , publisher=

  43. [43]

    Pattern Recognition , pages=

    Shape-centered representation learning for visible-infrared person re-identification , author=. Pattern Recognition , pages=. 2025 , publisher=