4 Aug 2017 | Lisa Anne Hendricks1*, Oliver Wang2, Eli Shechtman2, Josef Sivic2,3*, Trevor Darrell1, Bryan Russell2
This paper introduces the task of localizing video moments using natural language in an open-world setting. The authors propose the Moment Context Network (MCN), which integrates local and global video features to effectively localize natural language queries in videos. To train and evaluate their model, they collect the Distinct Describable Moments (DiDeMo) dataset, which contains over 40,000 pairs of localized video moments and corresponding natural language descriptions. The dataset includes over 10,000 unedited personal videos with diverse content such as pets, concerts, and sports games.
The MCN model is designed to capture both temporal and spatial context by integrating local video features, global video features, and temporal endpoint features. It uses a joint video-language embedding to align referring expressions with corresponding video moments. The model is trained using a ranking loss that encourages referring expressions to be closer to corresponding moments than negative moments in a shared embedding space.
The authors compare their model to several baselines, including a moment frequency prior and a canonical correlation analysis (CCA) model. They find that MCN outperforms these baselines in terms of ranking and intersection over union metrics. The MCN model is also shown to correctly retrieve a variety of moments, including those that require understanding temporal indicators and camera motion.
The DiDeMo dataset is unique in that it includes a validation step to ensure that descriptions refer to a specific moment in a video. The dataset is collected from personal videos and includes a wide range of visual concepts, making it suitable for studying temporal language grounding in an open-world setting. The authors also discuss the importance of both appearance and optical flow features for effective moment localization.
The paper concludes that while MCN performs well on the DiDeMo dataset, there are still challenges in modeling complex sentence structures and handling rare activities. The authors suggest that future work should focus on improving generalization to previously unseen vocabulary and advancing temporal language reasoning.This paper introduces the task of localizing video moments using natural language in an open-world setting. The authors propose the Moment Context Network (MCN), which integrates local and global video features to effectively localize natural language queries in videos. To train and evaluate their model, they collect the Distinct Describable Moments (DiDeMo) dataset, which contains over 40,000 pairs of localized video moments and corresponding natural language descriptions. The dataset includes over 10,000 unedited personal videos with diverse content such as pets, concerts, and sports games.
The MCN model is designed to capture both temporal and spatial context by integrating local video features, global video features, and temporal endpoint features. It uses a joint video-language embedding to align referring expressions with corresponding video moments. The model is trained using a ranking loss that encourages referring expressions to be closer to corresponding moments than negative moments in a shared embedding space.
The authors compare their model to several baselines, including a moment frequency prior and a canonical correlation analysis (CCA) model. They find that MCN outperforms these baselines in terms of ranking and intersection over union metrics. The MCN model is also shown to correctly retrieve a variety of moments, including those that require understanding temporal indicators and camera motion.
The DiDeMo dataset is unique in that it includes a validation step to ensure that descriptions refer to a specific moment in a video. The dataset is collected from personal videos and includes a wide range of visual concepts, making it suitable for studying temporal language grounding in an open-world setting. The authors also discuss the importance of both appearance and optical flow features for effective moment localization.
The paper concludes that while MCN performs well on the DiDeMo dataset, there are still challenges in modeling complex sentence structures and handling rare activities. The authors suggest that future work should focus on improving generalization to previously unseen vocabulary and advancing temporal language reasoning.