15 Jun 2017 | Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzyńska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, Florian Hoppe, Christian Thurau, Ingo Bax, Roland Memisevic
The "something something" video database is designed to help neural networks learn and evaluate visual common sense by providing a large collection of video clips with detailed textual descriptions. The database contains over 100,000 videos across 174 classes, each defined by caption-templates that include placeholders for objects and actions. The videos are generated by crowdworkers who record and label them based on the templates. The goal of the database is to encourage networks to develop features that can represent physical properties of objects and the world, such as spatial relations or material properties.
The database was created through a crowdsourcing approach, where workers are asked to record videos and complete caption-templates by providing appropriate input-text for placeholders. The dataset is split into train, validation, and test sets in a 8:1:1 ratio. The dataset includes a wide variety of actions and objects, with 23,137 distinct object names. The videos are short, ranging from 2 to 6 seconds in duration, and are labeled with simple textual descriptions.
The dataset is used to train and evaluate models that can predict the textual labels from the videos. The challenge lies in the complexity of the tasks, as the labels require understanding of physical concepts that are not easily captured by traditional image-based datasets. The database is designed to provide a more comprehensive understanding of the physical world by incorporating both visual and acoustic information.
The dataset is also used to explore the effectiveness of different neural network architectures, such as 2D and 3D convolutional networks, in learning and predicting the labels. The results show that the task is particularly challenging due to the subtle distinctions between classes and the ambiguities in the labels. The dataset is an ongoing effort, with the goal of continuously expanding and improving the collection of videos and labels to better support the development of models that can understand and reason about the physical world.The "something something" video database is designed to help neural networks learn and evaluate visual common sense by providing a large collection of video clips with detailed textual descriptions. The database contains over 100,000 videos across 174 classes, each defined by caption-templates that include placeholders for objects and actions. The videos are generated by crowdworkers who record and label them based on the templates. The goal of the database is to encourage networks to develop features that can represent physical properties of objects and the world, such as spatial relations or material properties.
The database was created through a crowdsourcing approach, where workers are asked to record videos and complete caption-templates by providing appropriate input-text for placeholders. The dataset is split into train, validation, and test sets in a 8:1:1 ratio. The dataset includes a wide variety of actions and objects, with 23,137 distinct object names. The videos are short, ranging from 2 to 6 seconds in duration, and are labeled with simple textual descriptions.
The dataset is used to train and evaluate models that can predict the textual labels from the videos. The challenge lies in the complexity of the tasks, as the labels require understanding of physical concepts that are not easily captured by traditional image-based datasets. The database is designed to provide a more comprehensive understanding of the physical world by incorporating both visual and acoustic information.
The dataset is also used to explore the effectiveness of different neural network architectures, such as 2D and 3D convolutional networks, in learning and predicting the labels. The results show that the task is particularly challenging due to the subtle distinctions between classes and the ambiguities in the labels. The dataset is an ongoing effort, with the goal of continuously expanding and improving the collection of videos and labels to better support the development of models that can understand and reason about the physical world.