CaC presents a new spatiotemporal concentrating reward model for video anomalies, built on a novel large-scale dataset and three-stage training with RL and IoU rewards, claiming 25.7% accuracy gains and 11.7% anomaly reduction.
Focus: Efficient keyframe selection for long video understanding
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CV 7years
2026 7representative citing papers
GridProbe uses posterior probing on a KxK frame grid to adaptively select question-relevant frames, delivering up to 3.36x TFLOPs reduction with accuracy within 1.6 pp of the full-frame baseline on Video-MME-v2.
LFS learns to select temporally diverse and event-aware frames for video captioning by using direct feedback from frozen video-LLMs, yielding gains up to 2% on VDC and over 4% on the new ICH-CC benchmark.
DynFrame introduces tokenized learnable span-density retrieval and Segment-Decoupled GRPO in video MLLMs, achieving competitive or SOTA results on six benchmarks with 4B and 8B models.
Q-Gate dynamically routes keyframe selection in long videos via query-modulated gating across visual grounding, global matching, and contextual alignment experts to improve MLLM performance.
VideoStir introduces a spatio-temporal graph-based structure and intent-aware retrieval for long-video RAG, achieving competitive performance with SOTA methods via a new IR-600K dataset.
Swift Sampling is a training-free frame selection method that uses Taylor expansions on video latent trajectories to pick temporally surprising frames, outperforming uniform sampling on long-video QA tasks.
citing papers explorer
-
VideoStir: Understanding Long Videos via Spatio-Temporally Structured and Intent-Aware RAG
VideoStir introduces a spatio-temporal graph-based structure and intent-aware retrieval for long-video RAG, achieving competitive performance with SOTA methods via a new IR-600K dataset.