18 Dec 2020 | Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, Phillip Isola
Contrastive learning using multiple views has achieved state-of-the-art performance in self-supervised representation learning. However, the choice of views significantly impacts the effectiveness of the learned representations. This paper argues that the optimal views should minimize mutual information (MI) between views while preserving task-relevant information. The authors propose a theoretical and empirical analysis to support this view, demonstrating that reducing MI improves downstream classification accuracy. They introduce a semi-supervised method to learn effective views and show that increasing data augmentation reduces MI and enhances performance. The study achieves a new state-of-the-art accuracy of 73% on ImageNet classification using a ResNet-50 model. The paper also explores the relationship between MI and representation quality, showing that there is an optimal "sweet spot" where MI is neither too high nor too low. The findings suggest that views should be designed to retain only the information necessary for the downstream task, avoiding unnecessary noise. The authors propose the "InfoMin principle," which complements the "InfoMax principle" by emphasizing the importance of task-relevant information in representation learning. The study also demonstrates that data augmentation can be used to reduce MI and improve performance. The results show that views constructed with this principle lead to better performance on various downstream tasks, including object detection and segmentation. The paper concludes that the choice of views is crucial for effective contrastive learning and that the InfoMin principle provides a useful framework for designing optimal views.Contrastive learning using multiple views has achieved state-of-the-art performance in self-supervised representation learning. However, the choice of views significantly impacts the effectiveness of the learned representations. This paper argues that the optimal views should minimize mutual information (MI) between views while preserving task-relevant information. The authors propose a theoretical and empirical analysis to support this view, demonstrating that reducing MI improves downstream classification accuracy. They introduce a semi-supervised method to learn effective views and show that increasing data augmentation reduces MI and enhances performance. The study achieves a new state-of-the-art accuracy of 73% on ImageNet classification using a ResNet-50 model. The paper also explores the relationship between MI and representation quality, showing that there is an optimal "sweet spot" where MI is neither too high nor too low. The findings suggest that views should be designed to retain only the information necessary for the downstream task, avoiding unnecessary noise. The authors propose the "InfoMin principle," which complements the "InfoMax principle" by emphasizing the importance of task-relevant information in representation learning. The study also demonstrates that data augmentation can be used to reduce MI and improve performance. The results show that views constructed with this principle lead to better performance on various downstream tasks, including object detection and segmentation. The paper concludes that the choice of views is crucial for effective contrastive learning and that the InfoMin principle provides a useful framework for designing optimal views.