Do Generated Data Always Help Contrastive Learning?
Contrastive learning (CL) has become a key method for unsupervised visual representation learning, but it often relies on intensive manual data augmentations. With the rise of generative models, especially diffusion models, high-quality generated images have been used to enhance CL through "data inflation." However, this study shows that generated data, even from good models like DDPM, can sometimes harm CL. The paper investigates the causes of this failure from both data inflation and data augmentation perspectives. It reveals that stronger data inflation should be paired with weaker augmentations, and vice versa. Theoretical explanations are provided for these phenomena, and an adaptive strategy called AdaInf is proposed, which adjusts data augmentation strength and mixing ratio without extra computation cost. AdaInf significantly improves CL performance on benchmark datasets, achieving 94.70% linear accuracy on CIFAR-10 with SimCLR. The study shows that generated data can be harmful if not properly balanced with data augmentation, and that adaptive strategies like AdaInf can effectively mitigate this issue. Theoretical analysis and experiments demonstrate that data inflation and augmentation have complementary roles in CL, and that adaptive approaches can lead to better performance.Do Generated Data Always Help Contrastive Learning?
Contrastive learning (CL) has become a key method for unsupervised visual representation learning, but it often relies on intensive manual data augmentations. With the rise of generative models, especially diffusion models, high-quality generated images have been used to enhance CL through "data inflation." However, this study shows that generated data, even from good models like DDPM, can sometimes harm CL. The paper investigates the causes of this failure from both data inflation and data augmentation perspectives. It reveals that stronger data inflation should be paired with weaker augmentations, and vice versa. Theoretical explanations are provided for these phenomena, and an adaptive strategy called AdaInf is proposed, which adjusts data augmentation strength and mixing ratio without extra computation cost. AdaInf significantly improves CL performance on benchmark datasets, achieving 94.70% linear accuracy on CIFAR-10 with SimCLR. The study shows that generated data can be harmful if not properly balanced with data augmentation, and that adaptive strategies like AdaInf can effectively mitigate this issue. Theoretical analysis and experiments demonstrate that data inflation and augmentation have complementary roles in CL, and that adaptive approaches can lead to better performance.