31 Jul 2018 | Dima Damen1, Hazel Doughty1, Giovanni Maria Farinella2, Sanja Fidler3, Antonino Furnari2, Evangelos Kazakos1, Davide Moltisanti1, Jonathan Munro1, Toby Perrett1, Will Price1, and Michael Wray1
The paper introduces EPIC-Kitchens, a large-scale egocentric video benchmark dataset recorded in participants' native kitchen environments. The dataset features 55 hours of video data, consisting of 11.5 million frames, densely labeled with 39,600 action segments and 454,300 object bounding boxes. The videos capture non-scripted daily activities, with participants recording their kitchen visits for three consecutive days. The annotations are unique in that they include narrations by the participants, reflecting true intentions, and crowd-sourced ground-truths based on these narrations. The dataset is designed to address challenges in object detection, action recognition, and anticipation, with two test splits: seen and unseen kitchens. The paper evaluates several baselines on these challenges and discusses the dataset's potential for advancing first-person vision research.The paper introduces EPIC-Kitchens, a large-scale egocentric video benchmark dataset recorded in participants' native kitchen environments. The dataset features 55 hours of video data, consisting of 11.5 million frames, densely labeled with 39,600 action segments and 454,300 object bounding boxes. The videos capture non-scripted daily activities, with participants recording their kitchen visits for three consecutive days. The annotations are unique in that they include narrations by the participants, reflecting true intentions, and crowd-sourced ground-truths based on these narrations. The dataset is designed to address challenges in object detection, action recognition, and anticipation, with two test splits: seen and unseen kitchens. The paper evaluates several baselines on these challenges and discusses the dataset's potential for advancing first-person vision research.