[slides and audio] MM-Interleaved%3A Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer

MM-Interleaved is an end-to-end generative model for interleaved image-text data, designed to efficiently capture fine-grained image details and generate consistent images and text. The model introduces a Multi-Modal Feature Synchronizer (MMFS) that enables direct access to high-resolution image features during generation, overcoming the limitations of fixed visual token numbers. MM-Interleaved is pre-trained on both paired and interleaved image-text data and further enhanced through supervised fine-tuning to improve multi-modal instruction following. It excels in tasks requiring visual details and generates images and text based on interleaved inputs. The model achieves state-of-the-art results on various benchmarks without using in-house data, demonstrating its effectiveness in multi-modal comprehension and generation. The architecture integrates a Visual Foundation Model, a Large Language Model, and a Diffusion Model, enabling efficient and accurate generation of both text and images. The MMFS module allows dynamic extraction of image features, enhancing the model's ability to handle multi-image and high-resolution scenarios. The model is trained and evaluated on multiple tasks, including visual question-answering, image captioning, and text-to-image generation, showing competitive performance and efficiency. The results highlight the model's capability to generate accurate text descriptions and visually consistent images, making it a significant advancement in multi-modal generative modeling.MM-Interleaved is an end-to-end generative model for interleaved image-text data, designed to efficiently capture fine-grained image details and generate consistent images and text. The model introduces a Multi-Modal Feature Synchronizer (MMFS) that enables direct access to high-resolution image features during generation, overcoming the limitations of fixed visual token numbers. MM-Interleaved is pre-trained on both paired and interleaved image-text data and further enhanced through supervised fine-tuning to improve multi-modal instruction following. It excels in tasks requiring visual details and generates images and text based on interleaved inputs. The model achieves state-of-the-art results on various benchmarks without using in-house data, demonstrating its effectiveness in multi-modal comprehension and generation. The architecture integrates a Visual Foundation Model, a Large Language Model, and a Diffusion Model, enabling efficient and accurate generation of both text and images. The MMFS module allows dynamic extraction of image features, enhancing the model's ability to handle multi-image and high-resolution scenarios. The model is trained and evaluated on multiple tasks, including visual question-answering, image captioning, and text-to-image generation, showing competitive performance and efficiency. The results highlight the model's capability to generate accurate text descriptions and visually consistent images, making it a significant advancement in multi-modal generative modeling.

MM-Interleaved: Interleaved Image-Text Generation via Multi-modal Feature Synchronizer

2 Apr 2024 | Changyao Tian, Xizhou Zhu, Yuwen Xiong, Weiyun Wang, Zhe Chen, Wenhai Wang, Yuntao Chen, Lewei Lu, Tong Lu, Jie Zhou, Hongsheng Li, Yu Qiao, Jifeng Dai