Learning Transferable Visual Models From Natural Language Supervision

Learning Transferable Visual Models From Natural Language Supervision

26 Feb 2021 | Alec Radford * 1 Jong Wook Kim * 1 Chris Hallacy 1 Aditya Ramesh 1 Gabriel Goh 1 Sandhini Agarwal 1 Girish Sastry 1 Amanda Askell 1 Pamela Mishkin 1 Jack Clark 1 Gretchen Krueger 1 Ilya Sutskever 1
This paper introduces CLIP, a model that learns image representations from natural language supervision. CLIP is trained to predict which caption corresponds to which image, enabling it to learn powerful image representations from a large dataset of 400 million (image, text) pairs. After pre-training, CLIP can be used for zero-shot transfer to downstream tasks by using natural language to reference or describe visual concepts. The model is tested on over 30 existing computer vision datasets, including tasks such as OCR, action recognition, and geo-localization. CLIP performs well on most tasks and is often competitive with fully supervised baselines without requiring dataset-specific training. For example, CLIP matches the accuracy of the original ResNet-50 on ImageNet without using any of the 1.28 million training examples it was trained on. The model is released with code and pre-trained weights at https://github.com/OpenAI/CLIP. CLIP is trained to predict the correct pairing of (image, text) pairs in a batch, learning a multi-modal embedding space by jointly training an image encoder and text encoder to maximize the cosine similarity of the image and text embeddings of the real pairs while minimizing the cosine similarity of the incorrect pairings. The model is evaluated on various datasets and shows strong performance in zero-shot transfer. It is also found to be more robust than equivalent accuracy supervised models, suggesting that zero-shot evaluation of task-agnostic models is more representative of a model's capability. CLIP is also found to be more efficient than prior task-specific supervised models and is able to perform a wide range of tasks during pre-training, including OCR, geo-localization, and action recognition. The paper also discusses the importance of using natural language supervision for image representation learning, as it allows for a much broader source of supervision compared to traditional crowd-labeled datasets. The authors argue that natural language supervision is a promising alternative to traditional supervised learning methods, as it can be scaled more easily and provides a more general form of supervision. The paper also discusses the challenges of using natural language supervision, including the need for careful design of the supervision to ensure that it is effective and that the model is able to learn useful representations. The authors also discuss the importance of evaluating models on a wide range of datasets to ensure that they are robust and generalizable. The paper concludes that CLIP represents a significant step forward in the development of flexible and practical zero-shot computer vision classifiers.This paper introduces CLIP, a model that learns image representations from natural language supervision. CLIP is trained to predict which caption corresponds to which image, enabling it to learn powerful image representations from a large dataset of 400 million (image, text) pairs. After pre-training, CLIP can be used for zero-shot transfer to downstream tasks by using natural language to reference or describe visual concepts. The model is tested on over 30 existing computer vision datasets, including tasks such as OCR, action recognition, and geo-localization. CLIP performs well on most tasks and is often competitive with fully supervised baselines without requiring dataset-specific training. For example, CLIP matches the accuracy of the original ResNet-50 on ImageNet without using any of the 1.28 million training examples it was trained on. The model is released with code and pre-trained weights at https://github.com/OpenAI/CLIP. CLIP is trained to predict the correct pairing of (image, text) pairs in a batch, learning a multi-modal embedding space by jointly training an image encoder and text encoder to maximize the cosine similarity of the image and text embeddings of the real pairs while minimizing the cosine similarity of the incorrect pairings. The model is evaluated on various datasets and shows strong performance in zero-shot transfer. It is also found to be more robust than equivalent accuracy supervised models, suggesting that zero-shot evaluation of task-agnostic models is more representative of a model's capability. CLIP is also found to be more efficient than prior task-specific supervised models and is able to perform a wide range of tasks during pre-training, including OCR, geo-localization, and action recognition. The paper also discusses the importance of using natural language supervision for image representation learning, as it allows for a much broader source of supervision compared to traditional crowd-labeled datasets. The authors argue that natural language supervision is a promising alternative to traditional supervised learning methods, as it can be scaled more easily and provides a more general form of supervision. The paper also discusses the challenges of using natural language supervision, including the need for careful design of the supervision to ensure that it is effective and that the model is able to learn useful representations. The authors also discuss the importance of evaluating models on a wide range of datasets to ensure that they are robust and generalizable. The paper concludes that CLIP represents a significant step forward in the development of flexible and practical zero-shot computer vision classifiers.
Reach us at info@study.space
[slides and audio] Learning Transferable Visual Models From Natural Language Supervision