April 15, 2024 | Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, Nicolas Ballas
This paper explores feature prediction as a standalone objective for unsupervised learning from video and introduces V-JEPA, a collection of vision models trained solely using a feature prediction objective, without the use of pretrained image encoders, text, negative examples, reconstruction, or other sources of supervision. The models are trained on 2 million videos collected from public datasets and are evaluated on downstream image and video tasks. Our results show that learning by predicting video features leads to versatile visual representations that perform well on both motion and appearance-based tasks, without adaptation of the model's parameters; e.g., using a frozen backbone. Our largest model, a ViT-H/16 trained only on videos, obtains 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet1K.
The paper revisits feature prediction as a standalone objective for unsupervised learning of visual representations from video. It explores the effectiveness of feature prediction as a standalone objective for unsupervised learning from video with modern tools. The paper introduces V-JEPA, a video joint-embedding predictive architecture that is based solely on feature prediction, without using pretrained image encoders, text, negative examples, human annotations, or pixel-level reconstruction. The models are trained on 2 million videos collected from public datasets and are evaluated on downstream image and video tasks. The paper shows that feature prediction can serve as an effective standalone objective for unsupervised learning from video, while using significantly shorter training schedules than pixel prediction methods. The paper also shows that models trained with feature prediction are superior to pixel prediction approaches under a frozen evaluation protocol and are competitive with pixel prediction under full fine-tuning, while using significantly shorter training schedules. Additionally, models trained with feature prediction are more label-efficient than pixel prediction approaches. The paper also shows that V-JEPA models are label-efficient learners and exhibit good performance on downstream tasks, even when only few labeled examples are available. The paper also shows that V-JEPA models are able to solve various downstream image and video tasks without adaptation of the model parameters. The paper also shows that V-JEPA models outperform previous video representation learning approaches in frozen evaluation on action recognition, spatio-temporal action detection, and image classification tasks. Finally, the paper shows that pretraining V-JEPA on videos is particularly effective for solving downstream tasks requiring fine-grained motion understanding, while large-scale image models trained on internet scale datasets fall short on such tasks.This paper explores feature prediction as a standalone objective for unsupervised learning from video and introduces V-JEPA, a collection of vision models trained solely using a feature prediction objective, without the use of pretrained image encoders, text, negative examples, reconstruction, or other sources of supervision. The models are trained on 2 million videos collected from public datasets and are evaluated on downstream image and video tasks. Our results show that learning by predicting video features leads to versatile visual representations that perform well on both motion and appearance-based tasks, without adaptation of the model's parameters; e.g., using a frozen backbone. Our largest model, a ViT-H/16 trained only on videos, obtains 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet1K.
The paper revisits feature prediction as a standalone objective for unsupervised learning of visual representations from video. It explores the effectiveness of feature prediction as a standalone objective for unsupervised learning from video with modern tools. The paper introduces V-JEPA, a video joint-embedding predictive architecture that is based solely on feature prediction, without using pretrained image encoders, text, negative examples, human annotations, or pixel-level reconstruction. The models are trained on 2 million videos collected from public datasets and are evaluated on downstream image and video tasks. The paper shows that feature prediction can serve as an effective standalone objective for unsupervised learning from video, while using significantly shorter training schedules than pixel prediction methods. The paper also shows that models trained with feature prediction are superior to pixel prediction approaches under a frozen evaluation protocol and are competitive with pixel prediction under full fine-tuning, while using significantly shorter training schedules. Additionally, models trained with feature prediction are more label-efficient than pixel prediction approaches. The paper also shows that V-JEPA models are label-efficient learners and exhibit good performance on downstream tasks, even when only few labeled examples are available. The paper also shows that V-JEPA models are able to solve various downstream image and video tasks without adaptation of the model parameters. The paper also shows that V-JEPA models outperform previous video representation learning approaches in frozen evaluation on action recognition, spatio-temporal action detection, and image classification tasks. Finally, the paper shows that pretraining V-JEPA on videos is particularly effective for solving downstream tasks requiring fine-grained motion understanding, while large-scale image models trained on internet scale datasets fall short on such tasks.