MM-Interleaved: Interleaved Image-Text Generation via Multi-modal Feature Synchronizer

MM-Interleaved: Interleaved Image-Text Generation via Multi-modal Feature Synchronizer

2 Apr 2024 | Changyao Tian, Xizhou Zhu, Yuwen Xiong, Weiyun Wang, Zhe Chen, Wenhai Wang, Yuntao Chen, Lewei Lu, Tong Lu, Jie Zhou, Hongsheng Li, Yu Qiao, Jifeng Dai
MM-Interleaved is an end-to-end generative model designed for interleaved image-text data, which combines multiple images with text in a sequence. The model addresses the challenge of efficiently capturing fine-grained image details using a multi-scale and multi-image feature synchronizer (MMFS). MMFS allows the model to access detailed image features from the previous context during generation, enhancing the ability to follow complex multi-modal instructions. MM-Interleaved is pre-trained on both paired and interleaved image-text corpora and further fine-tuned to improve its performance on various tasks, including visual question answering, image captioning, and text-to-image generation. Experiments demonstrate that MM-Interleaved outperforms existing methods in recognizing visual details and generating consistent images, achieving state-of-the-art results on multiple benchmarks without using in-house data. The key contributions of the paper include the introduction of MMFS for efficient feature extraction and the proposed MM-Interleaved model for effective interleaved image-text processing.MM-Interleaved is an end-to-end generative model designed for interleaved image-text data, which combines multiple images with text in a sequence. The model addresses the challenge of efficiently capturing fine-grained image details using a multi-scale and multi-image feature synchronizer (MMFS). MMFS allows the model to access detailed image features from the previous context during generation, enhancing the ability to follow complex multi-modal instructions. MM-Interleaved is pre-trained on both paired and interleaved image-text corpora and further fine-tuned to improve its performance on various tasks, including visual question answering, image captioning, and text-to-image generation. Experiments demonstrate that MM-Interleaved outperforms existing methods in recognizing visual details and generating consistent images, achieving state-of-the-art results on multiple benchmarks without using in-house data. The key contributions of the paper include the introduction of MMFS for efficient feature extraction and the proposed MM-Interleaved model for effective interleaved image-text processing.
Reach us at info@study.space