24 Nov 2024 | Xiaotong Li, Fan Zhang, Haiwen Diao, Yuee Wang, Xinlong Wang, Ling-Yu Duan
DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception
This paper introduces DenseFusion-1M, a large-scale image-text dataset designed to enhance the perception and cognition abilities of Multimodal Large Language Models (MLLMs). The dataset is created by integrating diverse vision experts as image priors and using a low-budget MLLM as a central pivot for information fusion. The dataset is generated by carefully selecting 1 million highly representative images from the LAION-2B dataset and generating dense descriptions using a caption engine called DenseFusion-1M. The resulting dataset significantly improves the perception and cognition abilities of existing MLLMs across diverse vision-language benchmarks, especially with high-resolution images as inputs.
The dataset is constructed through a perceptual fusion pipeline that leverages multi-source experts as image priors, establishing a low-budget yet powerful caption engine to comprehend image elements and generate well-crafted descriptions. The pipeline includes data processing, perceptual fusion, and the construction of the caption engine. The data processing stage involves selecting high-resolution images and semantic clustering and deduplication. The perceptual fusion stage involves integrating various visual experts to provide explicit information on visual elements and adopting an efficient MLLM as a centric pivot to mimic advanced MLLMs' perception abilities.
The caption engine is based on LLaVA-1.6 (7B) and utilizes high-resolution images as inputs to ensure better visibility of detailed visual clues. The expertise of visual specialists are extracted offline and adopted as contextual information for the caption engine. This process allows the engine to capture various visual clues effectively, enhancing its perception abilities by incorporating insights from vision experts. Consequently, it accurately identifies a wide range of objects and detailed textual information, resulting in image annotations with high information density.
The dataset is evaluated on various vision-language benchmarks, including ScienceQA, VQA-v2, GQA, MME, POPE, VQA-T, MMBench, SEED, and MM-Vet. The results show that the dataset significantly improves the performance of existing MLLMs, especially for text-recognition scenes. The dataset is also compared with state-of-the-art approaches, and it is found that the trained MLLM demonstrates superior performance against existing state-of-the-art MLLMs across 10 vision-language benchmarks, especially for detailed text recognition and high-resolution image perception.
The dataset and code are publicly available at https://github.com/baaivision/DenseFusion.DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception
This paper introduces DenseFusion-1M, a large-scale image-text dataset designed to enhance the perception and cognition abilities of Multimodal Large Language Models (MLLMs). The dataset is created by integrating diverse vision experts as image priors and using a low-budget MLLM as a central pivot for information fusion. The dataset is generated by carefully selecting 1 million highly representative images from the LAION-2B dataset and generating dense descriptions using a caption engine called DenseFusion-1M. The resulting dataset significantly improves the perception and cognition abilities of existing MLLMs across diverse vision-language benchmarks, especially with high-resolution images as inputs.
The dataset is constructed through a perceptual fusion pipeline that leverages multi-source experts as image priors, establishing a low-budget yet powerful caption engine to comprehend image elements and generate well-crafted descriptions. The pipeline includes data processing, perceptual fusion, and the construction of the caption engine. The data processing stage involves selecting high-resolution images and semantic clustering and deduplication. The perceptual fusion stage involves integrating various visual experts to provide explicit information on visual elements and adopting an efficient MLLM as a centric pivot to mimic advanced MLLMs' perception abilities.
The caption engine is based on LLaVA-1.6 (7B) and utilizes high-resolution images as inputs to ensure better visibility of detailed visual clues. The expertise of visual specialists are extracted offline and adopted as contextual information for the caption engine. This process allows the engine to capture various visual clues effectively, enhancing its perception abilities by incorporating insights from vision experts. Consequently, it accurately identifies a wide range of objects and detailed textual information, resulting in image annotations with high information density.
The dataset is evaluated on various vision-language benchmarks, including ScienceQA, VQA-v2, GQA, MME, POPE, VQA-T, MMBench, SEED, and MM-Vet. The results show that the dataset significantly improves the performance of existing MLLMs, especially for text-recognition scenes. The dataset is also compared with state-of-the-art approaches, and it is found that the trained MLLM demonstrates superior performance against existing state-of-the-art MLLMs across 10 vision-language benchmarks, especially for detailed text recognition and high-resolution image perception.
The dataset and code are publicly available at https://github.com/baaivision/DenseFusion.