A scalable training-free pipeline using video segmentation, filtering, and off-the-shelf multimodal models creates DenseStep2M, a dataset of 100K videos and 2M detailed instructional steps that improves dense captioning, step grounding, and cross-modal retrieval.
Title resolution pending
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
FOCAL cuts token use by 60% and VLM calls by 72% on desktop streams while raising key recall from 0.38 to 0.61 and staying robust to task switches that break baselines.
citing papers explorer
-
DenseStep2M: A Scalable, Training-Free Pipeline for Dense Instructional Video Annotation
A scalable training-free pipeline using video segmentation, filtering, and off-the-shelf multimodal models creates DenseStep2M, a dataset of 100K videos and 2M detailed instructional steps that improves dense captioning, step grounding, and cross-modal retrieval.
-
FOCAL: Filtered On-device Continuous Activity Logging for Efficient Personal Desktop Summarization
FOCAL cuts token use by 60% and VLM calls by 72% on desktop streams while raising key recall from 0.38 to 0.61 and staying robust to task switches that break baselines.