24 Nov 2024 | Xiaotong Li12; Fan Zhang2*, Haiwen Diao32*, Yueze Wang2, Xinlong Wang2†, Ling-Yu Duan1†
The paper "DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception" addresses the challenge of limited high-quality image-text data by proposing a low-budget caption engine for generating hyper-detailed captions. The authors curate a dataset from the LAION-2B corpus and develop a perceptual fusion pipeline that integrates insights from various vision experts to produce one million well-rounded image descriptions, named DenseFusion-1M. This dataset is designed to enhance the perceptual abilities of Multimodal Large Language Models (MLLMs) by providing more effective alignment between visual and textual data. The paper includes extensive experiments validating the effectiveness of DenseFusion-1M across multiple vision-language benchmarks, demonstrating significant improvements in text recognition and high-resolution image perception. The dataset and code are publicly available to promote further research in multimodal understanding and reasoning.The paper "DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception" addresses the challenge of limited high-quality image-text data by proposing a low-budget caption engine for generating hyper-detailed captions. The authors curate a dataset from the LAION-2B corpus and develop a perceptual fusion pipeline that integrates insights from various vision experts to produce one million well-rounded image descriptions, named DenseFusion-1M. This dataset is designed to enhance the perceptual abilities of Multimodal Large Language Models (MLLMs) by providing more effective alignment between visual and textual data. The paper includes extensive experiments validating the effectiveness of DenseFusion-1M across multiple vision-language benchmarks, demonstrating significant improvements in text recognition and high-resolution image perception. The dataset and code are publicly available to promote further research in multimodal understanding and reasoning.