MoAI: Mixture of All Intelligence for Large Language and Vision Models

MoAI: Mixture of All Intelligence for Large Language and Vision Models

17 Jul 2024 | Byung-Kwan Lee, Beomchan Park, Chae Won Kim, and Yong Man Ro
MoAI is a new large language and vision model that integrates auxiliary visual information from external computer vision models to enhance real-world scene understanding. The model leverages outputs from segmentation, detection, scene graph generation, and optical character recognition models, which are verbalized and condensed by the MoAI-Compressor. The MoAI-Mixer then combines visual, auxiliary, and language features using the concept of Mixture of Experts. This approach allows MoAI to outperform both open-source and closed-source LLMs in zero-shot vision language tasks, particularly those involving object existence, positions, relationships, and OCR, without requiring additional visual instruction tuning datasets or model scaling. MoAI's architecture includes a vision encoder, a backbone multimodal language model with MoAI-Mixers, and intermediate MLP connectors. The model uses a combination of cross- and self-attention modules to blend different types of intelligence, achieving significant improvements in visual perception and zero-shot performance. The results show that MoAI significantly outperforms existing models in various vision language benchmarks, demonstrating its effectiveness in real-world scene understanding.MoAI is a new large language and vision model that integrates auxiliary visual information from external computer vision models to enhance real-world scene understanding. The model leverages outputs from segmentation, detection, scene graph generation, and optical character recognition models, which are verbalized and condensed by the MoAI-Compressor. The MoAI-Mixer then combines visual, auxiliary, and language features using the concept of Mixture of Experts. This approach allows MoAI to outperform both open-source and closed-source LLMs in zero-shot vision language tasks, particularly those involving object existence, positions, relationships, and OCR, without requiring additional visual instruction tuning datasets or model scaling. MoAI's architecture includes a vision encoder, a backbone multimodal language model with MoAI-Mixers, and intermediate MLP connectors. The model uses a combination of cross- and self-attention modules to blend different types of intelligence, achieving significant improvements in visual perception and zero-shot performance. The results show that MoAI significantly outperforms existing models in various vision language benchmarks, demonstrating its effectiveness in real-world scene understanding.
Reach us at info@study.space
[slides and audio] MoAI%3A Mixture of All Intelligence for Large Language and Vision Models