MarkIt uses a query-to-mask bridge with open-vocabulary segmentation to add visual markers and frame indices to videos, enabling Vid-LLMs to achieve state-of-the-art temporal grounding on moment retrieval and highlight detection benchmarks.
Dense video captioning: A survey of techniques, datasets and evaluation protocols
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.MM 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
MarkIt: Training-Free Visual Markers for Precise Video Temporal Grounding
MarkIt uses a query-to-mask bridge with open-vocabulary segmentation to add visual markers and frame indices to videos, enabling Vid-LLMs to achieve state-of-the-art temporal grounding on moment retrieval and highlight detection benchmarks.