25 Jul 2018 | Bolei Zhou, Alex Andonian, Aude Oliva, Antonio Torralba
This paper introduces a Temporal Relation Network (TRN) for temporal relational reasoning in videos. TRN is a simple and interpretable module that enables neural networks to learn and reason about temporal dependencies between video frames at multiple time scales. The TRN is inspired by relational networks and focuses on modeling temporal relations between observations in videos. It is designed to be a general and extensible module that can be used with any existing CNN architecture. The TRN is evaluated on three recent video datasets - Something-Something, Jester, and Charades - which are constructed for recognizing different types of activities such as human-object interactions and hand gestures. The results show that TRN-equipped networks achieve very competitive results even given only discrete RGB frames, outperforming two-stream networks and 3D convolution networks in recognizing daily activities in the Charades dataset. Further analyses show that the models learn intuitive and interpretable visual common sense knowledge in videos. The TRN is efficient and can be trained with sparse sampling of video frames, making it suitable for real-time processing of streaming video. The TRN is also effective in capturing temporal relations at multiple time scales and outperforms dense frame-based networks using only sparsely sampled video frames. The TRN is shown to be effective in recognizing activities that depend on temporal relational reasoning, such as human-object interactions and hand gestures. The TRN is also effective in capturing the temporal order of frames, which is crucial for activity recognition. The TRN is shown to be effective in capturing the temporal relations between frames, which is essential for recognizing activities that involve transformations and temporal relations. The TRN is also effective in capturing the visual common sense knowledge in videos, which is crucial for recognizing activities that involve complex interactions between objects. The TRN is shown to be effective in capturing the temporal relations between frames, which is essential for recognizing activities that involve transformations and temporal relations. The TRN is also effective in capturing the visual common sense knowledge in videos, which is crucial for recognizing activities that involve complex interactions between objects.This paper introduces a Temporal Relation Network (TRN) for temporal relational reasoning in videos. TRN is a simple and interpretable module that enables neural networks to learn and reason about temporal dependencies between video frames at multiple time scales. The TRN is inspired by relational networks and focuses on modeling temporal relations between observations in videos. It is designed to be a general and extensible module that can be used with any existing CNN architecture. The TRN is evaluated on three recent video datasets - Something-Something, Jester, and Charades - which are constructed for recognizing different types of activities such as human-object interactions and hand gestures. The results show that TRN-equipped networks achieve very competitive results even given only discrete RGB frames, outperforming two-stream networks and 3D convolution networks in recognizing daily activities in the Charades dataset. Further analyses show that the models learn intuitive and interpretable visual common sense knowledge in videos. The TRN is efficient and can be trained with sparse sampling of video frames, making it suitable for real-time processing of streaming video. The TRN is also effective in capturing temporal relations at multiple time scales and outperforms dense frame-based networks using only sparsely sampled video frames. The TRN is shown to be effective in recognizing activities that depend on temporal relational reasoning, such as human-object interactions and hand gestures. The TRN is also effective in capturing the temporal order of frames, which is crucial for activity recognition. The TRN is shown to be effective in capturing the temporal relations between frames, which is essential for recognizing activities that involve transformations and temporal relations. The TRN is also effective in capturing the visual common sense knowledge in videos, which is crucial for recognizing activities that involve complex interactions between objects. The TRN is shown to be effective in capturing the temporal relations between frames, which is essential for recognizing activities that involve transformations and temporal relations. The TRN is also effective in capturing the visual common sense knowledge in videos, which is crucial for recognizing activities that involve complex interactions between objects.