MM-Interleaved is an end-to-end generative model for interleaved image-text data, designed to efficiently capture fine-grained image details and generate consistent images and text. The model introduces a Multi-Modal Feature Synchronizer (MMFS) that enables direct access to high-resolution image features during generation, overcoming the limitations of fixed visual token numbers. MM-Interleaved is pre-trained on both paired and interleaved image-text data and further enhanced through supervised fine-tuning to improve multi-modal instruction following. It excels in tasks requiring visual details and generates images and text based on interleaved inputs. The model achieves state-of-the-art results on various benchmarks without using in-house data, demonstrating its effectiveness in multi-modal comprehension and generation. The architecture integrates a Visual Foundation Model, a Large Language Model, and a Diffusion Model, enabling efficient and accurate generation of both text and images. The MMFS module allows dynamic extraction of image features, enhancing the model's ability to handle multi-image and high-resolution scenarios. The model is trained and evaluated on multiple tasks, including visual question-answering, image captioning, and text-to-image generation, showing competitive performance and efficiency. The results highlight the model's capability to generate accurate text descriptions and visually consistent images, making it a significant advancement in multi-modal generative modeling.MM-Interleaved is an end-to-end generative model for interleaved image-text data, designed to efficiently capture fine-grained image details and generate consistent images and text. The model introduces a Multi-Modal Feature Synchronizer (MMFS) that enables direct access to high-resolution image features during generation, overcoming the limitations of fixed visual token numbers. MM-Interleaved is pre-trained on both paired and interleaved image-text data and further enhanced through supervised fine-tuning to improve multi-modal instruction following. It excels in tasks requiring visual details and generates images and text based on interleaved inputs. The model achieves state-of-the-art results on various benchmarks without using in-house data, demonstrating its effectiveness in multi-modal comprehension and generation. The architecture integrates a Visual Foundation Model, a Large Language Model, and a Diffusion Model, enabling efficient and accurate generation of both text and images. The MMFS module allows dynamic extraction of image features, enhancing the model's ability to handle multi-image and high-resolution scenarios. The model is trained and evaluated on multiple tasks, including visual question-answering, image captioning, and text-to-image generation, showing competitive performance and efficiency. The results highlight the model's capability to generate accurate text descriptions and visually consistent images, making it a significant advancement in multi-modal generative modeling.