pith. machine review for the scientific record. sign in

arxiv: 1708.01641 · v1 · submitted 2017-08-04 · 💻 cs.CV

Recognition: unknown

Localizing Moments in Video with Natural Language

Authors on Pith no claims yet
classification 💻 cs.CV
keywords videolanguagenaturalmomentmomentsdidemoexpressionslocalized
0
0 comments X
read the original abstract

We consider retrieving a specific temporal segment, or moment, from a video given a natural language text description. Methods designed to retrieve whole video clips with natural language determine what occurs in a video but not when. To address this issue, we propose the Moment Context Network (MCN) which effectively localizes natural language queries in videos by integrating local and global video features over time. A key obstacle to training our MCN model is that current video datasets do not include pairs of localized video segments and referring expressions, or text descriptions which uniquely identify a corresponding moment. Therefore, we collect the Distinct Describable Moments (DiDeMo) dataset which consists of over 10,000 unedited, personal videos in diverse visual settings with pairs of localized video segments and referring expressions. We demonstrate that MCN outperforms several baseline methods and believe that our initial results together with the release of DiDeMo will inspire further research on localizing video moments with natural language.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. StoryTR: Narrative-Centric Video Temporal Retrieval with Theory of Mind Reasoning

    cs.AI 2026-04 unverdicted novelty 7.0

    StoryTR is a new benchmark and agentic data pipeline that adds explicit Theory of Mind reasoning chains to train smaller video retrieval models, yielding a 15% relative IoU gain over larger baselines on narrative content.