17 Jul 2024 | Byung-Kwan Lee, Beomchan Park, Chae Won Kim, and Yong Man Ro
The paper introduces MoAI, a new large language and vision model (LLVM) that leverages auxiliary visual information from external computer vision (CV) models to enhance real-world scene understanding. MoAI consists of two main modules: *MoAI-Compressor* and *MoAI-Mixer*. *MoAI-Compressor* processes the outputs of external CV models, such as panoptic segmentation, open-world object detection, scene graph generation, and optical character recognition (OCR), and aligns and condenses them into relevant visual information. *MoAI-Mixer* integrates this visual information with visual and language features from the LLVM backbone using the concept of Mixture of Experts (MoE). This integration allows MoAI to effectively handle various visual perception tasks, including object existence, positions, relations, and OCR, without requiring additional dataset curation or model scaling. Experimental results show that MoAI outperforms both open-source and closed-source LLVMs in zero-shot vision language (VL) tasks, demonstrating its effectiveness in real-world scene understanding.The paper introduces MoAI, a new large language and vision model (LLVM) that leverages auxiliary visual information from external computer vision (CV) models to enhance real-world scene understanding. MoAI consists of two main modules: *MoAI-Compressor* and *MoAI-Mixer*. *MoAI-Compressor* processes the outputs of external CV models, such as panoptic segmentation, open-world object detection, scene graph generation, and optical character recognition (OCR), and aligns and condenses them into relevant visual information. *MoAI-Mixer* integrates this visual information with visual and language features from the LLVM backbone using the concept of Mixture of Experts (MoE). This integration allows MoAI to effectively handle various visual perception tasks, including object existence, positions, relations, and OCR, without requiring additional dataset curation or model scaling. Experimental results show that MoAI outperforms both open-source and closed-source LLVMs in zero-shot vision language (VL) tasks, demonstrating its effectiveness in real-world scene understanding.