1 Mar 2024 | Quentin Garrido, Mahmoud Assran, Nicolas Ballas, Adrien Bardes, Laurent Najman, Yann LeCun
The paper introduces Image World Models (IWM), a novel approach to visual representation learning that leverages world models. IWM builds on Joint-Embedding Predictive Architecture (JEPA) and extends it to learn and predict the effects of global photometric transformations in latent space. The key aspects of learning performant IWM include conditioning, prediction difficulty, and capacity. The authors demonstrate that IWM can be adapted through fine-tuning to solve diverse tasks, outperforming previous self-supervised methods. They also show that the learned world model allows controlling the abstraction level of representations, enabling both invariant and equivariant representations. The paper provides guidelines for learning a good image world model and highlights the efficiency and versatility of predictor finetuning compared to encoder finetuning. Additionally, it explores the trade-offs between representation quality and adaptability, placing different families of methods on a spectrum of representation abstraction. The results suggest that learning image world models is a promising framework for visual representation learning.The paper introduces Image World Models (IWM), a novel approach to visual representation learning that leverages world models. IWM builds on Joint-Embedding Predictive Architecture (JEPA) and extends it to learn and predict the effects of global photometric transformations in latent space. The key aspects of learning performant IWM include conditioning, prediction difficulty, and capacity. The authors demonstrate that IWM can be adapted through fine-tuning to solve diverse tasks, outperforming previous self-supervised methods. They also show that the learned world model allows controlling the abstraction level of representations, enabling both invariant and equivariant representations. The paper provides guidelines for learning a good image world model and highlights the efficiency and versatility of predictor finetuning compared to encoder finetuning. Additionally, it explores the trade-offs between representation quality and adaptability, placing different families of methods on a spectrum of representation abstraction. The results suggest that learning image world models is a promising framework for visual representation learning.