[slides and audio] Harmonizing Visual Text Comprehension and Generation

This paper introduces TextHarmony, a versatile multimodal generative model designed to harmonize visual text comprehension and generation tasks. The model aims to address the inherent inconsistencies between vision and language modalities, which often lead to performance degradation in multimodal generation tasks. To achieve this, the authors propose Slide-LoRA, a novel module that dynamically aggregates modality-specific and modality-agnostic LoRA experts, partially decoupling the multimodal generation space. This approach allows TextHarmony to generate both images and texts within a single model instance, enhancing the generative process. The paper also presents DetailedTextCaps-100K, a high-quality image caption dataset created using an advanced closed-source MLLM. This dataset significantly improves the image generation quality of TextHarmony, making it more effective in visual text generation tasks. Experiments across various benchmarks demonstrate the effectiveness of TextHarmony. The model achieves comparable performance to modality-specific fine-tuning results with only a 2% increase in parameters and shows an average improvement of 2.5% in visual text comprehension tasks and 4.0% in visual text generation tasks. The paper concludes by highlighting the potential of integrated multimodal generation models in the visual text domain and setting a foundation for future research.This paper introduces TextHarmony, a versatile multimodal generative model designed to harmonize visual text comprehension and generation tasks. The model aims to address the inherent inconsistencies between vision and language modalities, which often lead to performance degradation in multimodal generation tasks. To achieve this, the authors propose Slide-LoRA, a novel module that dynamically aggregates modality-specific and modality-agnostic LoRA experts, partially decoupling the multimodal generation space. This approach allows TextHarmony to generate both images and texts within a single model instance, enhancing the generative process. The paper also presents DetailedTextCaps-100K, a high-quality image caption dataset created using an advanced closed-source MLLM. This dataset significantly improves the image generation quality of TextHarmony, making it more effective in visual text generation tasks. Experiments across various benchmarks demonstrate the effectiveness of TextHarmony. The model achieves comparable performance to modality-specific fine-tuning results with only a 2% increase in parameters and shows an average improvement of 2.5% in visual text comprehension tasks and 4.0% in visual text generation tasks. The paper concludes by highlighting the potential of integrated multimodal generation models in the visual text domain and setting a foundation for future research.

Harmonizing Visual Text Comprehension and Generation

23 Jul 2024 | Zhen Zhao1,2,* Jingqun Tang2,3,5, Binghong Wu2, Chunhui Lin2, Shu Wei2, Hao Liu2, Xin Tan1, Zhizhong Zhang1, Can Huang2, Yuan Xie1,5