Surveillancevqa-589k: A benchmark for comprehensive surveillance video-language understanding with large models

Liu, B · 2025 · arXiv 2505.12589

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

representative citing papers

MAVEN: A Multi-stage Agentic Annotation Pipeline for Video Reasoning Tasks

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

MAVEN pipeline generates multi-scale spatio-temporal event descriptions from videos using agentic adaptation and refinement, then produces training data that lets a fine-tuned 8B model outperform Gemini baselines on private CCTV and AccidentBench tasks.

CrashSight: A Phase-Aware, Infrastructure-Centric Video Benchmark for Traffic Crash Scene Understanding and Reasoning

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

CrashSight is a new infrastructure-focused benchmark showing that state-of-the-art vision-language models can describe crash scenes but fail at temporal and causal reasoning.

ProcObject-10K: Benchmarking Object-Centric Procedural Understanding in Instructional Videos

cs.CV · 2025-12-03 · conditional · novelty 7.0

ProcObject-10K is the first benchmark for object-centric procedural reasoning in videos that exposes a large gap where models answer questions plausibly but fail to ground their answers in the correct video segments.

MetaphorVU: Towards Metaphorical Video Understanding

cs.CV · 2026-05-25 · unverdicted · novelty 6.0

Introduces the first benchmark for metaphorical video understanding, identifies MLLM weaknesses in cross-domain mapping, and proposes an inference-time enhancement using a knowledge graph.

citing papers explorer

Showing 4 of 4 citing papers.

MAVEN: A Multi-stage Agentic Annotation Pipeline for Video Reasoning Tasks cs.CV · 2026-05-21 · unverdicted · none · ref 16
MAVEN pipeline generates multi-scale spatio-temporal event descriptions from videos using agentic adaptation and refinement, then produces training data that lets a fine-tuned 8B model outperform Gemini baselines on private CCTV and AccidentBench tasks.
CrashSight: A Phase-Aware, Infrastructure-Centric Video Benchmark for Traffic Crash Scene Understanding and Reasoning cs.CV · 2026-04-09 · unverdicted · none · ref 14
CrashSight is a new infrastructure-focused benchmark showing that state-of-the-art vision-language models can describe crash scenes but fail at temporal and causal reasoning.
ProcObject-10K: Benchmarking Object-Centric Procedural Understanding in Instructional Videos cs.CV · 2025-12-03 · conditional · none · ref 21
ProcObject-10K is the first benchmark for object-centric procedural reasoning in videos that exposes a large gap where models answer questions plausibly but fail to ground their answers in the correct video segments.
MetaphorVU: Towards Metaphorical Video Understanding cs.CV · 2026-05-25 · unverdicted · none · ref 15
Introduces the first benchmark for metaphorical video understanding, identifies MLLM weaknesses in cross-domain mapping, and proposes an inference-time enhancement using a knowledge graph.

Surveillancevqa-589k: A benchmark for comprehensive surveillance video-language understanding with large models

fields

years

verdicts

representative citing papers

citing papers explorer