mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

29 Mar 2024 | Qinghao Ye*, Haiyang Xu*, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, Chenliang Li, Yuanhong Xu, Hehong Chen, Junfeng Tian, Qi Qian, Ji Zhang, Fei Huang, Jingren Zhou
mPLUG-Owl is a novel training paradigm that enhances the multi-modal abilities of large language models (LLMs) through modularized learning. The model incorporates a foundation LLM, a visual knowledge module, and a visual abstractor module to support multiple modalities and facilitate diverse unimodal and multimodal abilities through modality collaboration. The training process involves a two-stage method for aligning image and text, which learns visual knowledge with the assistance of LLM while maintaining and improving the generation abilities of LLM. Experimental results show that mPLUG-Owl outperforms existing multi-modal models, demonstrating strong instruction and visual understanding abilities, multi-turn conversation abilities, and knowledge reasoning abilities. Additionally, the model exhibits unexpected abilities such as multi-image correlation and scene text understanding, making it suitable for real-world scenarios like vision-only document comprehension. The model's code, pre-trained model, instruction-tuned models, and evaluation set are available at https://github.com/X-PLUG/mPLUG-Owl. The online demo is available at https://www.modelscope.cn/studios/damo/mPLUG-Owl.mPLUG-Owl is a novel training paradigm that enhances the multi-modal abilities of large language models (LLMs) through modularized learning. The model incorporates a foundation LLM, a visual knowledge module, and a visual abstractor module to support multiple modalities and facilitate diverse unimodal and multimodal abilities through modality collaboration. The training process involves a two-stage method for aligning image and text, which learns visual knowledge with the assistance of LLM while maintaining and improving the generation abilities of LLM. Experimental results show that mPLUG-Owl outperforms existing multi-modal models, demonstrating strong instruction and visual understanding abilities, multi-turn conversation abilities, and knowledge reasoning abilities. Additionally, the model exhibits unexpected abilities such as multi-image correlation and scene text understanding, making it suitable for real-world scenarios like vision-only document comprehension. The model's code, pre-trained model, instruction-tuned models, and evaluation set are available at https://github.com/X-PLUG/mPLUG-Owl. The online demo is available at https://www.modelscope.cn/studios/damo/mPLUG-Owl.
Reach us at info@study.space