The paper introduces E5-V, a new framework designed to adapt Multimodal Large Language Models (MLLMs) for achieving universal multimodal embeddings. E5-V leverages prompts to bridge the modality gap between different types of inputs, demonstrating strong performance in multimodal embeddings without fine-tuning. The key contributions include:
1. **Prompt-Based Representation**: E5-V uses prompts to explicitly instruct MLLMs to represent multimodal inputs into words, unifying multimodal embeddings into the same embedding space.
2. **Single Modality Training**: E5-V trains the model exclusively on text pairs, significantly reducing training costs and eliminating the need for expensive multimodal training data.
3. **Performance on Various Tasks**: Extensive experiments on text-image retrieval, composed image retrieval, sentence embeddings, and image-image retrieval demonstrate E5-V's effectiveness in representing multimodal information.
The paper also discusses the limitations of previous approaches, such as CLIP, and highlights the advantages of E5-V in handling interleaved visual and language inputs. E5-V shows competitive performance on all tasks, often surpassing state-of-the-art models, and provides a more efficient and effective solution for universal multimodal embedding tasks.The paper introduces E5-V, a new framework designed to adapt Multimodal Large Language Models (MLLMs) for achieving universal multimodal embeddings. E5-V leverages prompts to bridge the modality gap between different types of inputs, demonstrating strong performance in multimodal embeddings without fine-tuning. The key contributions include:
1. **Prompt-Based Representation**: E5-V uses prompts to explicitly instruct MLLMs to represent multimodal inputs into words, unifying multimodal embeddings into the same embedding space.
2. **Single Modality Training**: E5-V trains the model exclusively on text pairs, significantly reducing training costs and eliminating the need for expensive multimodal training data.
3. **Performance on Various Tasks**: Extensive experiments on text-image retrieval, composed image retrieval, sentence embeddings, and image-image retrieval demonstrate E5-V's effectiveness in representing multimodal information.
The paper also discusses the limitations of previous approaches, such as CLIP, and highlights the advantages of E5-V in handling interleaved visual and language inputs. E5-V shows competitive performance on all tasks, often surpassing state-of-the-art models, and provides a more efficient and effective solution for universal multimodal embedding tasks.