30 Mar 2021 | Soravit Changpinyo, Piyush Sharma, Nan Ding, Radu Soricut
The paper introduces Conceptual 12M (CC12M), a large-scale vision-and-language pre-training dataset, which is an extension of the Conceptual Captions 3M (CC3M) dataset. CC12M consists of 12.4 million image-text pairs, significantly larger than CC3M, and is designed to address the limitations of existing datasets in terms of scale and diversity. The authors relax the filters used in CC3M to include more diverse and long-tail visual concepts, resulting in a dataset with a higher diversity degree and a longer-tail distribution of concepts. The paper evaluates CC12M on two main V+L tasks: vision-to-language generation and vision-and-language matching, focusing on long-tail recognition and out-of-distribution generalization. The results show that CC12M outperforms CC3M in both tasks, achieving state-of-the-art performance on the nocaps benchmark. The paper also discusses the broader impact of the dataset, highlighting its potential to improve the robustness of models in real-world scenarios and the need for careful handling of potential biases and unsuitable content.The paper introduces Conceptual 12M (CC12M), a large-scale vision-and-language pre-training dataset, which is an extension of the Conceptual Captions 3M (CC3M) dataset. CC12M consists of 12.4 million image-text pairs, significantly larger than CC3M, and is designed to address the limitations of existing datasets in terms of scale and diversity. The authors relax the filters used in CC3M to include more diverse and long-tail visual concepts, resulting in a dataset with a higher diversity degree and a longer-tail distribution of concepts. The paper evaluates CC12M on two main V+L tasks: vision-to-language generation and vision-and-language matching, focusing on long-tail recognition and out-of-distribution generalization. The results show that CC12M outperforms CC3M in both tasks, achieving state-of-the-art performance on the nocaps benchmark. The paper also discusses the broader impact of the dataset, highlighting its potential to improve the robustness of models in real-world scenarios and the need for careful handling of potential biases and unsuitable content.