[slides and audio] Temporal Relational Reasoning in Videos

The paper introduces the Temporal Relation Network (TRN), a network module designed to enable temporal relational reasoning in videos. TRN is aimed at learning and reasoning about temporal dependencies between video frames at multiple time scales. The authors evaluate TRN-equipped networks on three recent video datasets—Something-Something, Jester, and Charades—used for activity recognition tasks that fundamentally depend on temporal relational reasoning. The results demonstrate that TRN-equipped networks can accurately predict human-object interactions and identify various human gestures with competitive performance, even when given only sparsely sampled video frames. The paper also shows that TRN-equipped networks outperform two-stream networks and 3D convolution networks in recognizing daily activities. Further analyses reveal that the models learn intuitive and interpretable visual common sense knowledge in videos. The introduction highlights the importance of temporal relational reasoning in intelligent decision-making and activity recognition, while the related work section discusses existing approaches and challenges in activity recognition using convolutional neural networks. The Temporal Relation Networks section details the framework and training methods, followed by experimental results and discussions on the effectiveness of TRN in various datasets. The paper concludes by emphasizing the significance of TRN in enabling temporal relational reasoning and its potential for early activity recognition.The paper introduces the Temporal Relation Network (TRN), a network module designed to enable temporal relational reasoning in videos. TRN is aimed at learning and reasoning about temporal dependencies between video frames at multiple time scales. The authors evaluate TRN-equipped networks on three recent video datasets—Something-Something, Jester, and Charades—used for activity recognition tasks that fundamentally depend on temporal relational reasoning. The results demonstrate that TRN-equipped networks can accurately predict human-object interactions and identify various human gestures with competitive performance, even when given only sparsely sampled video frames. The paper also shows that TRN-equipped networks outperform two-stream networks and 3D convolution networks in recognizing daily activities. Further analyses reveal that the models learn intuitive and interpretable visual common sense knowledge in videos. The introduction highlights the importance of temporal relational reasoning in intelligent decision-making and activity recognition, while the related work section discusses existing approaches and challenges in activity recognition using convolutional neural networks. The Temporal Relation Networks section details the framework and training methods, followed by experimental results and discussions on the effectiveness of TRN in various datasets. The paper concludes by emphasizing the significance of TRN in enabling temporal relational reasoning and its potential for early activity recognition.

Temporal Relational Reasoning in Videos

25 Jul 2018 | Bolei Zhou, Alex Andonian, Aude Oliva, Antonio Torralba