31 Jul 2024 | Anas Awadalla, Le Xue, Oscar Lo, Manli Shu, Hannah Lee, Etash Guha, Matt Jordan, Sheng Shen, Mohamed Awadalla, Silvio Savarese, Caiming Xiong, Ran Xu, Yejin Choi, Ludwig Schmidt
MINT-1T is a large-scale, open-source multimodal interleaved dataset containing one trillion text tokens and 3.4 billion images, representing a tenfold increase over existing open-source datasets. It includes data from HTML, PDF, and ArXiv sources, enhancing diversity and breadth. The dataset was curated through extensive engineering efforts, including data sourcing, filtering, deduplication, and safety checks. MINT-1T outperforms previous datasets like OBELICS in model performance and provides a more diverse range of data sources. The dataset is used to train large multimodal models, demonstrating improved performance on various benchmarks. The dataset is released to benefit the research community, with detailed documentation and analysis provided. The work highlights the importance of large-scale, open-source multimodal datasets in advancing research and development in multimodal learning.MINT-1T is a large-scale, open-source multimodal interleaved dataset containing one trillion text tokens and 3.4 billion images, representing a tenfold increase over existing open-source datasets. It includes data from HTML, PDF, and ArXiv sources, enhancing diversity and breadth. The dataset was curated through extensive engineering efforts, including data sourcing, filtering, deduplication, and safety checks. MINT-1T outperforms previous datasets like OBELICS in model performance and provides a more diverse range of data sources. The dataset is used to train large multimodal models, demonstrating improved performance on various benchmarks. The dataset is released to benefit the research community, with detailed documentation and analysis provided. The work highlights the importance of large-scale, open-source multimodal datasets in advancing research and development in multimodal learning.