Ovis: Structural Embedding Alignment for Multimodal Large Language Model

Ovis: Structural Embedding Alignment for Multimodal Large Language Model

17 Jun 2024 | Shiyin Lu, Yang Li, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, Han-Jia Ye
Ovis is a novel multimodal large language model (MLLM) architecture designed to align visual and textual embeddings structurally. Unlike existing MLLMs that rely on connectors to bridge visual and textual information, Ovis introduces an additional learnable visual embedding table to generate structured visual embeddings, aligning the embedding strategies of both modalities. This approach mirrors the method used for generating textual embeddings, where each image patch is mapped to a probabilistic token, which then indexes the visual embedding table multiple times to produce a final visual embedding that is a probabilistic combination of the indexed embeddings. This structural alignment enables more effective fusion of visual and textual information. Empirical evaluations on various multimodal benchmarks show that Ovis outperforms open-source MLLMs of similar parameter scales and even surpasses the proprietary model Qwen-VL-Plus. These results highlight the potential of Ovis' structured visual representation for advancing MLLM architectural design and promoting more effective multimodal learning. Ovis also demonstrates strong performance in specialized multimodal tasks, including math reasoning, real-world visual tasks, and hallucination benchmarks. The model's training strategy involves three stages: initial training with a visual embedding table, further training of the visual embedding table and parameters, and finally training the entire model on multimodal instruction datasets. Ovis achieves superior performance across multiple benchmarks, including MMStar, MMBench, and specialized benchmarks like MathVista and HallusionBench. The model's effectiveness is further validated through ablation studies, which show that Ovis consistently outperforms connector-based architectures. Despite its strengths, Ovis has limitations, including its reliance on single-image samples and the absence of high-resolution-boosted techniques, which may affect its performance on high-resolution visual tasks. Overall, Ovis represents a significant advancement in MLLM architecture, offering a more structured and effective approach to multimodal learning.Ovis is a novel multimodal large language model (MLLM) architecture designed to align visual and textual embeddings structurally. Unlike existing MLLMs that rely on connectors to bridge visual and textual information, Ovis introduces an additional learnable visual embedding table to generate structured visual embeddings, aligning the embedding strategies of both modalities. This approach mirrors the method used for generating textual embeddings, where each image patch is mapped to a probabilistic token, which then indexes the visual embedding table multiple times to produce a final visual embedding that is a probabilistic combination of the indexed embeddings. This structural alignment enables more effective fusion of visual and textual information. Empirical evaluations on various multimodal benchmarks show that Ovis outperforms open-source MLLMs of similar parameter scales and even surpasses the proprietary model Qwen-VL-Plus. These results highlight the potential of Ovis' structured visual representation for advancing MLLM architectural design and promoting more effective multimodal learning. Ovis also demonstrates strong performance in specialized multimodal tasks, including math reasoning, real-world visual tasks, and hallucination benchmarks. The model's training strategy involves three stages: initial training with a visual embedding table, further training of the visual embedding table and parameters, and finally training the entire model on multimodal instruction datasets. Ovis achieves superior performance across multiple benchmarks, including MMStar, MMBench, and specialized benchmarks like MathVista and HallusionBench. The model's effectiveness is further validated through ablation studies, which show that Ovis consistently outperforms connector-based architectures. Despite its strengths, Ovis has limitations, including its reliance on single-image samples and the absence of high-resolution-boosted techniques, which may affect its performance on high-resolution visual tasks. Overall, Ovis represents a significant advancement in MLLM architecture, offering a more structured and effective approach to multimodal learning.
Reach us at info@study.space
[slides and audio] Ovis%3A Structural Embedding Alignment for Multimodal Large Language Model