2 Aug 2016 | Limin Wang1, Yuanjun Xiong2, Zhe Wang3, Yu Qiao3, Dahua Lin2, Xiaou Tang2, and Luc Van Gool1
This paper introduces the Temporal Segment Network (TSN), a novel framework for video-based action recognition that focuses on modeling long-range temporal structures. TSN combines sparse temporal sampling and video-level supervision to enable efficient and effective learning using the entire action video. The authors also explore a series of good practices for learning ConvNets on video data with limited training samples, including cross-modality pre-training, regularization techniques, and enhanced data augmentation. The proposed method achieves state-of-the-art performance on the HMDB51 and UCF101 datasets, demonstrating the effectiveness of TSN and the proposed good practices. Visualizations of the learned ConvNet models further illustrate the effectiveness of TSN and the proposed practices.This paper introduces the Temporal Segment Network (TSN), a novel framework for video-based action recognition that focuses on modeling long-range temporal structures. TSN combines sparse temporal sampling and video-level supervision to enable efficient and effective learning using the entire action video. The authors also explore a series of good practices for learning ConvNets on video data with limited training samples, including cross-modality pre-training, regularization techniques, and enhanced data augmentation. The proposed method achieves state-of-the-art performance on the HMDB51 and UCF101 datasets, demonstrating the effectiveness of TSN and the proposed good practices. Visualizations of the learned ConvNet models further illustrate the effectiveness of TSN and the proposed practices.