COSMO: Contrastive Streamlined Multimodal Model with Interleaved Pre-Training

COSMO: Contrastive Streamlined Multimodal Model with Interleaved Pre-Training

1 Jan 2024 | Alex Jinpeng Wang, Linjie Li, Kevin Qinghong Lin, Jianfeng Wang, Kevin Lin, Zhengyuan Yang, Lijuan Wang, Mike Zheng Shou
COSMO is a contrastive streamlined multimodal model with interleaved pre-training, designed to enhance performance in both image-text and video-text tasks. The model introduces a novel architecture that partitions a large language model (LLM) into dedicated text processing and multimodal data handling components, incorporating a contrastive loss to improve alignment between visual and textual representations. COSMO achieves superior performance compared to existing models like OpenFlamingo, using only 34% of the learnable parameters and demonstrating significant improvements in tasks such as image captioning, with performance increasing from 57.2% to 65.1% in a 4-shot flickr captioning task. To address the challenge of limited high-quality long-text video datasets, the paper introduces Howto-Interlink7M, a new interleaved video-text dataset derived from Howto100M, leveraging GPT-4 to generate detailed annotations. This dataset includes comprehensive captions and is structured to enhance the coherence and detail of video captions. Howto-Interlink7M significantly improves model performance in both image-text and video-text tasks, showcasing the importance of high-quality interleaved data in multimodal learning. The paper also presents an analysis of the effectiveness of different data sampling strategies and the impact of varying sequence lengths on model performance. It highlights the importance of balanced data sampling and the benefits of using larger visual encoders, while also emphasizing the need for efficient model design to reduce computational costs. The model is evaluated across 14 diverse downstream datasets, demonstrating its effectiveness in both image-text and video-text tasks. The study contributes to the field of vision-language pre-training by introducing a new architecture that incorporates contrastive learning, a new interleaved video-text dataset, and demonstrating the effectiveness of high-quality interleaved data in improving model performance. The results show that the proposed model outperforms existing models in various tasks, highlighting the potential of contrastive learning in multimodal learning. The paper also discusses the limitations of current datasets and the need for further research in this area.COSMO is a contrastive streamlined multimodal model with interleaved pre-training, designed to enhance performance in both image-text and video-text tasks. The model introduces a novel architecture that partitions a large language model (LLM) into dedicated text processing and multimodal data handling components, incorporating a contrastive loss to improve alignment between visual and textual representations. COSMO achieves superior performance compared to existing models like OpenFlamingo, using only 34% of the learnable parameters and demonstrating significant improvements in tasks such as image captioning, with performance increasing from 57.2% to 65.1% in a 4-shot flickr captioning task. To address the challenge of limited high-quality long-text video datasets, the paper introduces Howto-Interlink7M, a new interleaved video-text dataset derived from Howto100M, leveraging GPT-4 to generate detailed annotations. This dataset includes comprehensive captions and is structured to enhance the coherence and detail of video captions. Howto-Interlink7M significantly improves model performance in both image-text and video-text tasks, showcasing the importance of high-quality interleaved data in multimodal learning. The paper also presents an analysis of the effectiveness of different data sampling strategies and the impact of varying sequence lengths on model performance. It highlights the importance of balanced data sampling and the benefits of using larger visual encoders, while also emphasizing the need for efficient model design to reduce computational costs. The model is evaluated across 14 diverse downstream datasets, demonstrating its effectiveness in both image-text and video-text tasks. The study contributes to the field of vision-language pre-training by introducing a new architecture that incorporates contrastive learning, a new interleaved video-text dataset, and demonstrating the effectiveness of high-quality interleaved data in improving model performance. The results show that the proposed model outperforms existing models in various tasks, highlighting the potential of contrastive learning in multimodal learning. The paper also discusses the limitations of current datasets and the need for further research in this area.
Reach us at info@study.space