29 Mar 2024 | Qinghao Ye*, Haiyang Xu*, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, Chenliang Li, Yuanhong Xu, Hehong Chen, Junfeng Tian, Qi Qian, Ji Zhang, Fei Huang, Jingren Zhou
mPLUG-Owl is a novel training paradigm designed to enhance the multi-modal capabilities of large language models (LLMs). It introduces a modularized learning approach that includes a foundation LLM, a visual knowledge module, and a visual abstracter module. This method supports multiple modalities and facilitates diverse unimodal and multimodal abilities through modality collaboration. The training process involves a two-stage method: first, aligning image and text using the visual knowledge and abstractor modules, and second, fine-tuning the model with language-only and multi-modal instructions. The approach enhances the LLM's generation abilities while maintaining and improving its performance. Experimental results demonstrate that mPLUG-Owl outperforms existing multi-modal models in instruction understanding, visual understanding, knowledge transfer, and multi-turn dialogue. The model also exhibits unexpected abilities such as multi-image correlation and scene text understanding, making it suitable for real-world applications like vision-only document comprehension. The code, pre-trained model, instruction-tuned models, and evaluation set are available online.mPLUG-Owl is a novel training paradigm designed to enhance the multi-modal capabilities of large language models (LLMs). It introduces a modularized learning approach that includes a foundation LLM, a visual knowledge module, and a visual abstracter module. This method supports multiple modalities and facilitates diverse unimodal and multimodal abilities through modality collaboration. The training process involves a two-stage method: first, aligning image and text using the visual knowledge and abstractor modules, and second, fine-tuning the model with language-only and multi-modal instructions. The approach enhances the LLM's generation abilities while maintaining and improving its performance. Experimental results demonstrate that mPLUG-Owl outperforms existing multi-modal models in instruction understanding, visual understanding, knowledge transfer, and multi-turn dialogue. The model also exhibits unexpected abilities such as multi-image correlation and scene text understanding, making it suitable for real-world applications like vision-only document comprehension. The code, pre-trained model, instruction-tuned models, and evaluation set are available online.