18 Oct 2022 | Zhan Tong, Yibing Song, Jue Wang, Limin Wang
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
**Abstract:**
This paper introduces VideoMAE, a data-efficient self-supervised video pre-training method. Inspired by ImageMAE, VideoMAE employs a high masking ratio (90% to 95%) and a tube masking strategy to make video reconstruction a challenging task, encouraging the model to learn more effective video representations. Key findings include:
1. An extremely high masking ratio (90% to 95%) still yields favorable performance, leveraging temporal redundancy in videos.
2. VideoMAE achieves impressive results on small datasets (3k-4k videos) without additional data, highlighting the effectiveness of video reconstruction as a self-supervision task.
3. Data quality is more critical than quantity for self-supervised video pre-training, especially when there is a domain shift between the pre-training and target datasets.
**Introduction:**
Video transformers require large-scale datasets for optimal performance. VideoMAE addresses this by using a simple pipeline of masking random cubes and reconstructing them, tailored for video data with temporal redundancy and correlation. The high masking ratio and tube masking strategy prevent information leakage and encourage learning high-level spatiotemporal structures.
**Related Work:**
The paper reviews existing methods for video representation learning, including supervised and semi-supervised approaches, and contrastive learning. It also discusses masked visual modeling, emphasizing the importance of temporal information in video data.
**VideoMAE Design:**
VideoMAE uses temporal downsampling to reduce the number of frames, cube embedding to decrease spatial and temporal dimensions, and a high masking ratio to increase reconstruction difficulty. The tube masking strategy ensures that temporal neighbors of masked cubes are also masked, preventing information leakage.
**Experiments:**
VideoMAE is evaluated on five datasets: Kinetics-400, Something-Something V2, UCF101, HMDB51, and AVA. Ablation studies show the effectiveness of the proposed designs. VideoMAE outperforms other methods, including training from scratch and contrastive learning, on both large and small datasets. It demonstrates superior transferability and generalization capabilities.
**Conclusion:**
VideoMAE is a simple and data-efficient method for self-supervised video pre-training, leveraging high masking ratios and tube masking to learn effective video representations. It shows significant practical value, especially in scenarios with limited data.VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
**Abstract:**
This paper introduces VideoMAE, a data-efficient self-supervised video pre-training method. Inspired by ImageMAE, VideoMAE employs a high masking ratio (90% to 95%) and a tube masking strategy to make video reconstruction a challenging task, encouraging the model to learn more effective video representations. Key findings include:
1. An extremely high masking ratio (90% to 95%) still yields favorable performance, leveraging temporal redundancy in videos.
2. VideoMAE achieves impressive results on small datasets (3k-4k videos) without additional data, highlighting the effectiveness of video reconstruction as a self-supervision task.
3. Data quality is more critical than quantity for self-supervised video pre-training, especially when there is a domain shift between the pre-training and target datasets.
**Introduction:**
Video transformers require large-scale datasets for optimal performance. VideoMAE addresses this by using a simple pipeline of masking random cubes and reconstructing them, tailored for video data with temporal redundancy and correlation. The high masking ratio and tube masking strategy prevent information leakage and encourage learning high-level spatiotemporal structures.
**Related Work:**
The paper reviews existing methods for video representation learning, including supervised and semi-supervised approaches, and contrastive learning. It also discusses masked visual modeling, emphasizing the importance of temporal information in video data.
**VideoMAE Design:**
VideoMAE uses temporal downsampling to reduce the number of frames, cube embedding to decrease spatial and temporal dimensions, and a high masking ratio to increase reconstruction difficulty. The tube masking strategy ensures that temporal neighbors of masked cubes are also masked, preventing information leakage.
**Experiments:**
VideoMAE is evaluated on five datasets: Kinetics-400, Something-Something V2, UCF101, HMDB51, and AVA. Ablation studies show the effectiveness of the proposed designs. VideoMAE outperforms other methods, including training from scratch and contrastive learning, on both large and small datasets. It demonstrates superior transferability and generalization capabilities.
**Conclusion:**
VideoMAE is a simple and data-efficient method for self-supervised video pre-training, leveraging high masking ratios and tube masking to learn effective video representations. It shows significant practical value, especially in scenarios with limited data.