[slides] E5-V%3A Universal Embeddings with Multimodal Large Language Models

The paper introduces E5-V, a new framework designed to adapt Multimodal Large Language Models (MLLMs) for achieving universal multimodal embeddings. E5-V leverages prompts to bridge the modality gap between different types of inputs, demonstrating strong performance in multimodal embeddings without fine-tuning. The key contributions include: 1. **Prompt-Based Representation**: E5-V uses prompts to explicitly instruct MLLMs to represent multimodal inputs into words, unifying multimodal embeddings into the same embedding space. 2. **Single Modality Training**: E5-V trains the model exclusively on text pairs, significantly reducing training costs and eliminating the need for expensive multimodal training data. 3. **Performance on Various Tasks**: Extensive experiments on text-image retrieval, composed image retrieval, sentence embeddings, and image-image retrieval demonstrate E5-V's effectiveness in representing multimodal information. The paper also discusses the limitations of previous approaches, such as CLIP, and highlights the advantages of E5-V in handling interleaved visual and language inputs. E5-V shows competitive performance on all tasks, often surpassing state-of-the-art models, and provides a more efficient and effective solution for universal multimodal embedding tasks.The paper introduces E5-V, a new framework designed to adapt Multimodal Large Language Models (MLLMs) for achieving universal multimodal embeddings. E5-V leverages prompts to bridge the modality gap between different types of inputs, demonstrating strong performance in multimodal embeddings without fine-tuning. The key contributions include: 1. **Prompt-Based Representation**: E5-V uses prompts to explicitly instruct MLLMs to represent multimodal inputs into words, unifying multimodal embeddings into the same embedding space. 2. **Single Modality Training**: E5-V trains the model exclusively on text pairs, significantly reducing training costs and eliminating the need for expensive multimodal training data. 3. **Performance on Various Tasks**: Extensive experiments on text-image retrieval, composed image retrieval, sentence embeddings, and image-image retrieval demonstrate E5-V's effectiveness in representing multimodal information. The paper also discusses the limitations of previous approaches, such as CLIP, and highlights the advantages of E5-V in handling interleaved visual and language inputs. E5-V shows competitive performance on all tasks, often surpassing state-of-the-art models, and provides a more efficient and effective solution for universal multimodal embedding tasks.

E5-V: Universal Embeddings with Multimodal Large Language Models

17 Jul 2024 | Ting Jiang, Minghui Song, Zihan Zhang, Haizhen Huang, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang, Fuzhen Zhuang