Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts

Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts

30 Mar 2021 | Soravit Changpinyo, Piyush Sharma, Nan Ding, Radu Soricut
The paper introduces Conceptual 12M (CC12M), a new vision-and-language (V+L) pre-training dataset with 12 million image-text pairs, designed to be used for V+L pre-training. CC12M is derived from Conceptual Captions 3M (CC3M) by relaxing the data collection pipeline, resulting in a larger and more diverse dataset. The authors analyze CC12M and compare its effectiveness with CC3M on various downstream tasks, particularly long-tail visual recognition. Their results show that scaling up pre-training data significantly improves performance on tasks such as image captioning, novel object captioning, and image retrieval. CC12M covers a wider range of visual concepts and has a higher diversity of concepts compared to CC3M. The dataset is also more visually diverse and includes a broader range of concepts, including rare and long-tail visual concepts. The authors also evaluate the dataset's performance on tasks such as image captioning and visual-linguistic matching, showing that CC12M outperforms CC3M in these tasks. The paper concludes that using large-scale, noisy Web-scale image-text pairs as pre-training data is a promising direction for V+L research.The paper introduces Conceptual 12M (CC12M), a new vision-and-language (V+L) pre-training dataset with 12 million image-text pairs, designed to be used for V+L pre-training. CC12M is derived from Conceptual Captions 3M (CC3M) by relaxing the data collection pipeline, resulting in a larger and more diverse dataset. The authors analyze CC12M and compare its effectiveness with CC3M on various downstream tasks, particularly long-tail visual recognition. Their results show that scaling up pre-training data significantly improves performance on tasks such as image captioning, novel object captioning, and image retrieval. CC12M covers a wider range of visual concepts and has a higher diversity of concepts compared to CC3M. The dataset is also more visually diverse and includes a broader range of concepts, including rare and long-tail visual concepts. The authors also evaluate the dataset's performance on tasks such as image captioning and visual-linguistic matching, showing that CC12M outperforms CC3M in these tasks. The paper concludes that using large-scale, noisy Web-scale image-text pairs as pre-training data is a promising direction for V+L research.
Reach us at info@study.space
Understanding Conceptual 12M%3A Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts