Revisiting Feature Prediction for Learning Visual Representations from Video

Revisiting Feature Prediction for Learning Visual Representations from Video

April 15, 2024 | Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, Nicolas Ballas
This paper explores the effectiveness of feature prediction as a standalone objective for unsupervised learning from video. The authors introduce V-JEPA, a collection of vision models trained solely using a feature prediction objective, without the use of pretrained image encoders, text, negative examples, reconstruction, or other sources of supervision. The models are trained on 2 million videos collected from public datasets and evaluated on downstream image and video tasks. The results show that learning by predicting video features leads to versatile visual representations that perform well on both motion and appearance-based tasks, without the need for parameter adaptation. Specifically, the largest model, a ViT-H/16 trained only on videos, achieves 81.9% accuracy on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet1K. The paper also discusses the theoretical motivation behind the feature prediction objective, the impact of various design choices, and compares V-JEPA with other state-of-the-art methods, demonstrating its superior performance and label-efficiency.This paper explores the effectiveness of feature prediction as a standalone objective for unsupervised learning from video. The authors introduce V-JEPA, a collection of vision models trained solely using a feature prediction objective, without the use of pretrained image encoders, text, negative examples, reconstruction, or other sources of supervision. The models are trained on 2 million videos collected from public datasets and evaluated on downstream image and video tasks. The results show that learning by predicting video features leads to versatile visual representations that perform well on both motion and appearance-based tasks, without the need for parameter adaptation. Specifically, the largest model, a ViT-H/16 trained only on videos, achieves 81.9% accuracy on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet1K. The paper also discusses the theoretical motivation behind the feature prediction objective, the impact of various design choices, and compares V-JEPA with other state-of-the-art methods, demonstrating its superior performance and label-efficiency.
Reach us at info@study.space
[slides] Revisiting Feature Prediction for Learning Visual Representations from Video | StudySpace