1 Jan 2024 | Alex Jinpeng Wang, Linjie Li, Kevin Qinghong Lin, Jianfeng Wang, Kevin Lin, Zhengyuan Yang, Lijuan Wang, Mike Zheng Shou
The paper introduces the COnttractive-Streamlined MultimOdal framework (CosMo), which integrates a contrastive loss into text generation models to enhance performance in tasks involving textual and visual data. CosMo is designed to handle both unimodal and multimodal inputs, reducing the number of learnable parameters while maintaining computational efficiency. The authors also introduce Howto-Interlink7M, a high-quality interleaved video-text dataset, to address the lack of such datasets in the field. This dataset, derived from the Howto100M dataset using GPT-4, provides comprehensive captions and improves model performance in image-text tasks. CosMo is evaluated on 14 diverse downstream datasets, demonstrating superior performance over OpenFlamingo with fewer parameters and less data. The contributions of CosMo and Howto-Interlink7M are highlighted by significant performance gains across various image-text and video-text tasks.The paper introduces the COnttractive-Streamlined MultimOdal framework (CosMo), which integrates a contrastive loss into text generation models to enhance performance in tasks involving textual and visual data. CosMo is designed to handle both unimodal and multimodal inputs, reducing the number of learnable parameters while maintaining computational efficiency. The authors also introduce Howto-Interlink7M, a high-quality interleaved video-text dataset, to address the lack of such datasets in the field. This dataset, derived from the Howto100M dataset using GPT-4, provides comprehensive captions and improves model performance in image-text tasks. CosMo is evaluated on 14 diverse downstream datasets, demonstrating superior performance over OpenFlamingo with fewer parameters and less data. The contributions of CosMo and Howto-Interlink7M are highlighted by significant performance gains across various image-text and video-text tasks.