The “something something” video database for learning and evaluating visual common sense

The “something something” video database for learning and evaluating visual common sense

15 Jun 2017 | Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzyńska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, Florian Hoppe, Christian Thurau, Ingo Bax, Roland Memisevic
The paper introduces the "something-something" video database, a collection of over 100,000 videos across 174 classes designed to enhance visual common sense understanding. Unlike traditional image datasets, this database focuses on detailed physical aspects and actions, aiming to improve neural networks' ability to reason about complex scenes and integrate visual knowledge with natural language. The videos are labeled using caption-templates, allowing for fine-grained descriptions of objects and actions. The authors describe the challenges in crowd-sourcing this data at scale and present a scalable framework for video recording. They also discuss the difficulties in learning from video data, emphasizing the need for short, detailed labels and the importance of fine-grained discrimination tasks. Baseline experiments using various encoding methods show that the task is challenging, even for sophisticated architectures like 3D CNNs. The paper concludes by highlighting the ongoing nature of the dataset collection effort and its potential to advance common sense reasoning in visual tasks.The paper introduces the "something-something" video database, a collection of over 100,000 videos across 174 classes designed to enhance visual common sense understanding. Unlike traditional image datasets, this database focuses on detailed physical aspects and actions, aiming to improve neural networks' ability to reason about complex scenes and integrate visual knowledge with natural language. The videos are labeled using caption-templates, allowing for fine-grained descriptions of objects and actions. The authors describe the challenges in crowd-sourcing this data at scale and present a scalable framework for video recording. They also discuss the difficulties in learning from video data, emphasizing the need for short, detailed labels and the importance of fine-grained discrimination tasks. Baseline experiments using various encoding methods show that the task is challenging, even for sophisticated architectures like 3D CNNs. The paper concludes by highlighting the ongoing nature of the dataset collection effort and its potential to advance common sense reasoning in visual tasks.
Reach us at info@study.space