18 Oct 2022 | Zhan Tong, Yibing Song, Jue Wang, Limin Wang
VideoMAE is a data-efficient self-supervised video pre-training method that achieves strong performance with minimal data. Inspired by ImageMAE, VideoMAE uses a high masking ratio (90-95%) and a customized tube masking strategy to create a challenging self-supervised task for video reconstruction. This approach enables the model to learn more effective spatiotemporal representations. Key findings include: (1) High masking ratios still yield good performance due to video's temporal redundancy; (2) VideoMAE performs well on small datasets without extra data, thanks to the challenging reconstruction task; (3) Data quality is more important than quantity for SSVP, with domain shift between pre-training and target datasets being a critical factor. VideoMAE achieves 87.4% on Kinetics-400, 75.4% on Something-Something V2, 91.3% on UCF101, and 62.6% on HMDB51 without extra data. The method uses a vanilla ViT backbone with an asymmetric encoder-decoder architecture and high masking ratio to reduce computational cost and improve performance. VideoMAE outperforms previous methods on multiple video datasets, demonstrating its effectiveness in self-supervised video pre-training. The method is data-efficient, capable of learning effective representations from small datasets, and shows strong transferability to downstream tasks like action detection. VideoMAE's design addresses challenges in video pre-training by leveraging temporal redundancy and correlation, and it outperforms contrastive learning methods in terms of data efficiency and performance. The method is also efficient in terms of training time and computational resources.VideoMAE is a data-efficient self-supervised video pre-training method that achieves strong performance with minimal data. Inspired by ImageMAE, VideoMAE uses a high masking ratio (90-95%) and a customized tube masking strategy to create a challenging self-supervised task for video reconstruction. This approach enables the model to learn more effective spatiotemporal representations. Key findings include: (1) High masking ratios still yield good performance due to video's temporal redundancy; (2) VideoMAE performs well on small datasets without extra data, thanks to the challenging reconstruction task; (3) Data quality is more important than quantity for SSVP, with domain shift between pre-training and target datasets being a critical factor. VideoMAE achieves 87.4% on Kinetics-400, 75.4% on Something-Something V2, 91.3% on UCF101, and 62.6% on HMDB51 without extra data. The method uses a vanilla ViT backbone with an asymmetric encoder-decoder architecture and high masking ratio to reduce computational cost and improve performance. VideoMAE outperforms previous methods on multiple video datasets, demonstrating its effectiveness in self-supervised video pre-training. The method is data-efficient, capable of learning effective representations from small datasets, and shows strong transferability to downstream tasks like action detection. VideoMAE's design addresses challenges in video pre-training by leveraging temporal redundancy and correlation, and it outperforms contrastive learning methods in terms of data efficiency and performance. The method is also efficient in terms of training time and computational resources.