31 Jul 2024 | Anas Awadalla, Le Xue, Oscar Lo, Manli Shu, Hannah Lee, Etash Guha, Matt Jordan, Sheng Shen, Mohamed Awadalla, Silvio Savarese, Caiming Xiong, Ran Xu, Yejin Choi, Ludwig Schmidt
MINT-1T is a groundbreaking open-source multimodal dataset that scales up existing datasets by 10x, comprising one trillion text tokens and 3.4 billion images. This dataset addresses the scarcity of large-scale, open-source multimodal interleaved datasets, which are crucial for training advanced large multimodal models (LMMs). MINT-1T includes diverse sources such as HTML, PDFs, and ArXiv papers, making it more extensive and diverse than previous datasets. The authors detail the data engineering process, which involves handling large document sizes, preserving the original ordering of images and text, and filtering out low-quality and inappropriate content. Experiments show that LMMs trained on MINT-1T perform on par with or surpass those trained on the previous leading dataset, OBELICS, in various benchmarks. The dataset is available at https://github.com/mlfoundations/MINT-1T, and the authors discuss its potential to bridge the gap between open and closed-source models in multimodal research.MINT-1T is a groundbreaking open-source multimodal dataset that scales up existing datasets by 10x, comprising one trillion text tokens and 3.4 billion images. This dataset addresses the scarcity of large-scale, open-source multimodal interleaved datasets, which are crucial for training advanced large multimodal models (LMMs). MINT-1T includes diverse sources such as HTML, PDFs, and ArXiv papers, making it more extensive and diverse than previous datasets. The authors detail the data engineering process, which involves handling large document sizes, preserving the original ordering of images and text, and filtering out low-quality and inappropriate content. Experiments show that LMMs trained on MINT-1T perform on par with or surpass those trained on the previous leading dataset, OBELICS, in various benchmarks. The dataset is available at https://github.com/mlfoundations/MINT-1T, and the authors discuss its potential to bridge the gap between open and closed-source models in multimodal research.