Understanding Learning Transferable Visual Models From Natural Language Supervision

The paper introduces CLIP (Contrastive Language-Image Pre-training), a method for learning visual representations from natural language supervision. CLIP is trained on a large dataset of 400 million (image, text) pairs collected from the internet, using a contrastive learning objective to predict the correct pairings of images and their associated captions. This approach leverages the broad source of supervision available in web-scale text data, which is more scalable and expressive than traditional crowd-sourced labeling. The pre-trained model is then used to perform zero-shot transfer to various downstream tasks, such as OCR, action recognition, and fine-grained object classification, achieving competitive performance with fully supervised baselines without requiring additional dataset-specific training. The paper also discusses the scalability of CLIP, demonstrating that its transfer performance scales smoothly with computational resources. Additionally, the authors analyze the representation learning capabilities of CLIP, showing that it outperforms existing models in zero-shot and few-shot learning tasks. The results highlight the potential of using natural language supervision for computer vision tasks, suggesting that CLIP could be a powerful tool for developing task-agnostic and efficient visual models.The paper introduces CLIP (Contrastive Language-Image Pre-training), a method for learning visual representations from natural language supervision. CLIP is trained on a large dataset of 400 million (image, text) pairs collected from the internet, using a contrastive learning objective to predict the correct pairings of images and their associated captions. This approach leverages the broad source of supervision available in web-scale text data, which is more scalable and expressive than traditional crowd-sourced labeling. The pre-trained model is then used to perform zero-shot transfer to various downstream tasks, such as OCR, action recognition, and fine-grained object classification, achieving competitive performance with fully supervised baselines without requiring additional dataset-specific training. The paper also discusses the scalability of CLIP, demonstrating that its transfer performance scales smoothly with computational resources. Additionally, the authors analyze the representation learning capabilities of CLIP, showing that it outperforms existing models in zero-shot and few-shot learning tasks. The results highlight the potential of using natural language supervision for computer vision tasks, suggesting that CLIP could be a powerful tool for developing task-agnostic and efficient visual models.

Learning Transferable Visual Models From Natural Language Supervision

26 Feb 2021 | Alec Radford * 1 Jong Wook Kim * 1 Chris Hallacy 1 Aditya Ramesh 1 Gabriel Goh 1 Sandhini Agarwal 1 Girish Sastry 1 Amanda Askell 1 Pamela Mishkin 1 Jack Clark 1 Gretchen Krueger 1 Ilya Sutskever 1