This paper presents TextHarmony, a unified and versatile multimodal generative model that excels in both visual text comprehension and generation. The model addresses the challenge of generating images and texts simultaneously, which often leads to performance degradation due to the inherent inconsistency between vision and language modalities. To overcome this, the authors propose Slide-LoRA, a dynamic mechanism that integrates modality-specific and modality-agnostic LoRA experts, partially decoupling the multimodal generation space. This allows TextHarmony to generate both text and images within a single model instance, achieving performance comparable to modality-specific fine-tuning with only a 2% increase in parameters. Additionally, the paper introduces a high-quality image caption dataset, DetailedTextCaps-100K, synthesized using a closed-source MLLM to enhance visual text generation capabilities. Comprehensive experiments across various benchmarks demonstrate the effectiveness of the proposed approach, showing an average improvement of 2.5% in visual text comprehension tasks and 4.0% in visual text generation tasks. The model architecture integrates a vision encoder, an LLM, and an image decoder, with Slide-LoRA enabling the model to handle both text and image generation tasks effectively. The results show that TextHarmony performs well in various visual text-centric tasks, including text detection, recognition, and generation, achieving state-of-the-art performance in text grounding tasks. The paper also discusses the limitations of the approach, including the model's performance in visual text perception and comprehension tasks compared to state-of-the-art models, and the need for further research to refine the capabilities of multimodal generative models.This paper presents TextHarmony, a unified and versatile multimodal generative model that excels in both visual text comprehension and generation. The model addresses the challenge of generating images and texts simultaneously, which often leads to performance degradation due to the inherent inconsistency between vision and language modalities. To overcome this, the authors propose Slide-LoRA, a dynamic mechanism that integrates modality-specific and modality-agnostic LoRA experts, partially decoupling the multimodal generation space. This allows TextHarmony to generate both text and images within a single model instance, achieving performance comparable to modality-specific fine-tuning with only a 2% increase in parameters. Additionally, the paper introduces a high-quality image caption dataset, DetailedTextCaps-100K, synthesized using a closed-source MLLM to enhance visual text generation capabilities. Comprehensive experiments across various benchmarks demonstrate the effectiveness of the proposed approach, showing an average improvement of 2.5% in visual text comprehension tasks and 4.0% in visual text generation tasks. The model architecture integrates a vision encoder, an LLM, and an image decoder, with Slide-LoRA enabling the model to handle both text and image generation tasks effectively. The results show that TextHarmony performs well in various visual text-centric tasks, including text detection, recognition, and generation, achieving state-of-the-art performance in text grounding tasks. The paper also discusses the limitations of the approach, including the model's performance in visual text perception and comprehension tasks compared to state-of-the-art models, and the need for further research to refine the capabilities of multimodal generative models.