1 Mar 2024 | Quentin Garrido, Mahmoud Assran, Nicolas Ballas, Adrien Bardes, Laurent Najman, Yann LeCun
This paper introduces Image World Models (IWM), a method for learning self-supervised visual representations using world models. IWM extends the Joint-Embedding Predictive Architecture (JEPA) by incorporating photometric transformations in latent space, enabling the model to predict the effects of global transformations. The key aspects of learning a capable world model include the complexity of transformations, conditioning on transformations, and the capacity of the predictor. IWM learns to predict transformations in latent space, allowing for the control of abstraction levels in learned representations. The model can be adapted through fine-tuning to solve diverse tasks, often outperforming previous self-supervised methods. The paper demonstrates that IWM can be used for downstream tasks such as image classification and segmentation, with fine-tuning leading to improved performance. The study also shows that the capacity of the world model directly influences the level of abstraction of the learned representations. IWM enables flexible representations by allowing the model to interpolate between contrastive and masked image modeling approaches. The results indicate that IWM provides a versatile framework for visual representation learning, with the ability to control the abstraction level of representations and adapt to various downstream tasks. The paper concludes that learning world models offers a promising approach for visual representation learning, with the potential to improve performance across a wide range of tasks.This paper introduces Image World Models (IWM), a method for learning self-supervised visual representations using world models. IWM extends the Joint-Embedding Predictive Architecture (JEPA) by incorporating photometric transformations in latent space, enabling the model to predict the effects of global transformations. The key aspects of learning a capable world model include the complexity of transformations, conditioning on transformations, and the capacity of the predictor. IWM learns to predict transformations in latent space, allowing for the control of abstraction levels in learned representations. The model can be adapted through fine-tuning to solve diverse tasks, often outperforming previous self-supervised methods. The paper demonstrates that IWM can be used for downstream tasks such as image classification and segmentation, with fine-tuning leading to improved performance. The study also shows that the capacity of the world model directly influences the level of abstraction of the learned representations. IWM enables flexible representations by allowing the model to interpolate between contrastive and masked image modeling approaches. The results indicate that IWM provides a versatile framework for visual representation learning, with the ability to control the abstraction level of representations and adapt to various downstream tasks. The paper concludes that learning world models offers a promising approach for visual representation learning, with the potential to improve performance across a wide range of tasks.