Dense-Captioning Events in Videos

Dense-Captioning Events in Videos

2 May 2017 | Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, Juan Carlos Niebles
This paper introduces the task of dense-captioning events in videos, which involves both detecting and describing events in a video. The authors propose a new model that can identify all events in a single pass of the video while simultaneously describing the detected events with natural language. The model introduces a variant of an existing proposal module that can capture both short and long events. It also introduces a new captioning module that uses contextual information from past and future events to jointly describe all events. The authors also introduce ActivityNet Captions, a large-scale benchmark for dense-captioning events, containing 20k videos with 100k total descriptions. They report performances of their model for dense-captioning events, video retrieval, and localization. The model is evaluated on the ActivityNet Captions dataset, which contains videos as long as 10 minutes with each video annotated with an average of 3.65 sentences. The authors show that their model can detect and describe events in long or even streaming videos, and that utilizing context from other events in the video improves dense-captioning events. They also demonstrate how ActivityNet Captions can be used to study video retrieval and event localization. The paper also compares their model with existing video captioning models and shows that their model performs better in terms of captioning accuracy. The authors conclude that their model is effective for dense-captioning events and that the ActivityNet Captions dataset is a valuable resource for further research in this area.This paper introduces the task of dense-captioning events in videos, which involves both detecting and describing events in a video. The authors propose a new model that can identify all events in a single pass of the video while simultaneously describing the detected events with natural language. The model introduces a variant of an existing proposal module that can capture both short and long events. It also introduces a new captioning module that uses contextual information from past and future events to jointly describe all events. The authors also introduce ActivityNet Captions, a large-scale benchmark for dense-captioning events, containing 20k videos with 100k total descriptions. They report performances of their model for dense-captioning events, video retrieval, and localization. The model is evaluated on the ActivityNet Captions dataset, which contains videos as long as 10 minutes with each video annotated with an average of 3.65 sentences. The authors show that their model can detect and describe events in long or even streaming videos, and that utilizing context from other events in the video improves dense-captioning events. They also demonstrate how ActivityNet Captions can be used to study video retrieval and event localization. The paper also compares their model with existing video captioning models and shows that their model performs better in terms of captioning accuracy. The authors conclude that their model is effective for dense-captioning events and that the ActivityNet Captions dataset is a valuable resource for further research in this area.
Reach us at info@study.space
Understanding Dense-Captioning Events in Videos