[slides and audio] Ovis%3A Structural Embedding Alignment for Multimodal Large Language Model

The paper introduces Ovis, a novel architecture for Multimodal Large Language Models (MLLMs) that aims to align visual and textual embeddings structurally. Current MLLMs typically integrate a pre-trained LLM with a pre-trained vision transformer through a connector, but the misalignment between structural textual embeddings and continuous visual embeddings poses challenges. Ovis addresses this by incorporating an additional learnable visual embedding table into the visual encoder's process. Each image patch is indexed multiple times from this table, resulting in a final visual embedding that is a probabilistic combination of indexed embeddings, mirroring the method used for textual embeddings. Empirical evaluations on various benchmarks show that Ovis outperforms open-source MLLMs of similar parameter scales and even surpasses the proprietary model Qwen-VL-Plus. The paper highlights the potential of structured visual representations in advancing MLLM architectural design and promoting more effective multimodal learning.The paper introduces Ovis, a novel architecture for Multimodal Large Language Models (MLLMs) that aims to align visual and textual embeddings structurally. Current MLLMs typically integrate a pre-trained LLM with a pre-trained vision transformer through a connector, but the misalignment between structural textual embeddings and continuous visual embeddings poses challenges. Ovis addresses this by incorporating an additional learnable visual embedding table into the visual encoder's process. Each image patch is indexed multiple times from this table, resulting in a final visual embedding that is a probabilistic combination of indexed embeddings, mirroring the method used for textual embeddings. Empirical evaluations on various benchmarks show that Ovis outperforms open-source MLLMs of similar parameter scales and even surpasses the proprietary model Qwen-VL-Plus. The paper highlights the potential of structured visual representations in advancing MLLM architectural design and promoting more effective multimodal learning.

Ovis: Structural Embedding Alignment for Multimodal Large Language Model

17 Jun 2024 | Shiyin Lu1 Yang Li1 Qing-Guo Chen1 Zhao Xu1 Weihua Luo1 Kaifu Zhang1 Han-Jia Ye2,3*