30 Jan 2024 | Xiaoran Fan*, Tao Ji*, Changhao Jiang*, Shuo Li*, Senjie Jin*, Sirui Song, Junke Wang, Boyang Hong, Lu Chen, Guodong Zheng, Ming Zhang, Caishuang Huang, Rui Zheng, Zhiheng Xi, Yuhao Zhou, Shihan Dou, Junjie Ye, Hang Yan, Tao Gui†, Qi Zhang†, Xipeng Qiu, Xuanjing Huang, Zuxuan Wu, Yu-Gang Jiang
MouSi is a poly-visual-expert vision-language model that integrates multiple visual encoders to enhance performance. The model addresses challenges such as insufficient visual capabilities and excessive visual token lengths in existing large vision-language models (VLMs). By combining experts skilled in tasks like image-text matching, OCR, and image segmentation, MouSi improves the model's ability to interpret complex visual information. A fusion network unifies the outputs of different visual experts, while positional encoding schemes reduce the waste of positional embeddings caused by long image sequences. Experimental results show that VLMs with multiple experts outperform isolated visual encoders, with performance improving as more experts are integrated. The model is open-sourced, and its training code is available on the project website. MouSi's architecture includes a multi-expert visual encoder, a poly-expert fusion network, and a pre-trained open-source LLM. The fusion network uses methods like MLP projection and Q-Former to compress visual information and reduce computational costs. Different positional encoding schemes are explored to optimize the model's performance. Experiments on nine benchmarks demonstrate that MouSi significantly outperforms existing models, with the best performance in eight out of nine benchmarks. The model's ability to handle complex multimodal tasks is highlighted, showing strong capabilities in vision-language understanding and generation.MouSi is a poly-visual-expert vision-language model that integrates multiple visual encoders to enhance performance. The model addresses challenges such as insufficient visual capabilities and excessive visual token lengths in existing large vision-language models (VLMs). By combining experts skilled in tasks like image-text matching, OCR, and image segmentation, MouSi improves the model's ability to interpret complex visual information. A fusion network unifies the outputs of different visual experts, while positional encoding schemes reduce the waste of positional embeddings caused by long image sequences. Experimental results show that VLMs with multiple experts outperform isolated visual encoders, with performance improving as more experts are integrated. The model is open-sourced, and its training code is available on the project website. MouSi's architecture includes a multi-expert visual encoder, a poly-expert fusion network, and a pre-trained open-source LLM. The fusion network uses methods like MLP projection and Q-Former to compress visual information and reduce computational costs. Different positional encoding schemes are explored to optimize the model's performance. Experiments on nine benchmarks demonstrate that MouSi significantly outperforms existing models, with the best performance in eight out of nine benchmarks. The model's ability to handle complex multimodal tasks is highlighted, showing strong capabilities in vision-language understanding and generation.