[slides] MouSi%3A Poly-Visual-Expert Vision-Language Models

This paper introduces MouSi, a novel poly-visual-expert vision-language model (VLM) designed to enhance the performance and applicability of VLMs by leveraging the strengths of multiple visual encoders. The authors address challenges such as insufficient capabilities of a single visual component and excessive visual token lengths, which can limit the model's ability to interpret complex visual information and handle lengthy contextual data. MouSi employs an ensemble of visual encoders, including those skilled in image-text matching, OCR, and image segmentation, and introduces a fusion network to unify the processing of outputs from different visual experts. Additionally, the paper explores different positional encoding schemes to reduce the waste of positional encoding caused by lengthy image feature sequences, effectively addressing position overflow and length limitations. The experimental results demonstrate that VLMs with multiple experts exhibit consistently superior performance compared to isolated visual encoders, with the number of experts having a significant impact on performance. The authors also conduct ablation studies to evaluate the effectiveness of different fusion methods and positional encoding schemes, providing insights into the optimal configuration for VLMs. MouSi's architecture consists of a multi-expert visual encoder, a poly-expert fusion network, and a pre-trained open-source LLM. The multi-expert visual encoder combines six well-known visual encoders, and the poly-expert fusion network is implemented as either an MLP projection network or a Q-Former network. The paper also explores different positional encoding schemes to improve the assignment of position embeddings, reducing the computational cost and memory usage of VLMs. The main results show that MouSi outperforms existing VLMs in a broad range of benchmarks, with the best performance achieved by a triple-expert combination of LayoutLMv3, DINOv2, and CLIP. The authors further enhance the model's performance through data augmentation, demonstrating that more data can significantly improve the capabilities of VLMs. Overall, the paper contributes to the field by proposing a novel poly-visual-expert VLM that leverages the strengths of multiple visual encoders, addressing key challenges in VLMs and achieving superior performance in multimodal tasks.This paper introduces MouSi, a novel poly-visual-expert vision-language model (VLM) designed to enhance the performance and applicability of VLMs by leveraging the strengths of multiple visual encoders. The authors address challenges such as insufficient capabilities of a single visual component and excessive visual token lengths, which can limit the model's ability to interpret complex visual information and handle lengthy contextual data. MouSi employs an ensemble of visual encoders, including those skilled in image-text matching, OCR, and image segmentation, and introduces a fusion network to unify the processing of outputs from different visual experts. Additionally, the paper explores different positional encoding schemes to reduce the waste of positional encoding caused by lengthy image feature sequences, effectively addressing position overflow and length limitations. The experimental results demonstrate that VLMs with multiple experts exhibit consistently superior performance compared to isolated visual encoders, with the number of experts having a significant impact on performance. The authors also conduct ablation studies to evaluate the effectiveness of different fusion methods and positional encoding schemes, providing insights into the optimal configuration for VLMs. MouSi's architecture consists of a multi-expert visual encoder, a poly-expert fusion network, and a pre-trained open-source LLM. The multi-expert visual encoder combines six well-known visual encoders, and the poly-expert fusion network is implemented as either an MLP projection network or a Q-Former network. The paper also explores different positional encoding schemes to improve the assignment of position embeddings, reducing the computational cost and memory usage of VLMs. The main results show that MouSi outperforms existing VLMs in a broad range of benchmarks, with the best performance achieved by a triple-expert combination of LayoutLMv3, DINOv2, and CLIP. The authors further enhance the model's performance through data augmentation, demonstrating that more data can significantly improve the capabilities of VLMs. Overall, the paper contributes to the field by proposing a novel poly-visual-expert VLM that leverages the strengths of multiple visual encoders, addressing key challenges in VLMs and achieving superior performance in multimodal tasks.

MouSi: Poly-Visual-Expert Vision-Language Models