E5-V: Universal Embeddings with Multimodal Large Language Models

E5-V: Universal Embeddings with Multimodal Large Language Models

17 Jul 2024 | Ting Jiang, Minghui Song, Zihan Zhang, Haizhen Huang, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang, Fuzhen Zhuang
E5-V is a universal multimodal embedding model based on large language models (LLMs) that can represent interleaved visual and language inputs. The model uses a prompt-based representation method to unify multimodal embeddings into the same space without fine-tuning. This approach allows E5-V to achieve strong performance in various tasks, including text-image retrieval, composed image retrieval, image-image retrieval, and sentence embeddings, even without using multimodal training data. The model is trained on text pairs only, significantly reducing training costs and eliminating the need for costly multimodal training data collection. E5-V demonstrates superior performance compared to existing methods, particularly in tasks requiring understanding of interleaved visual and language inputs. The model's ability to represent multimodal information effectively is validated through extensive experiments across multiple tasks, showing that it can achieve competitive or superior performance compared to state-of-the-art models. The key contributions of E5-V include the development of a new framework for universal multimodal embeddings using LLMs, the use of single modality training to reduce costs, and the demonstration of strong performance in various tasks without the need for multimodal training data. The model's effectiveness is further supported by its ability to follow zero-shot instructions and represent inputs based on detailed prompts.E5-V is a universal multimodal embedding model based on large language models (LLMs) that can represent interleaved visual and language inputs. The model uses a prompt-based representation method to unify multimodal embeddings into the same space without fine-tuning. This approach allows E5-V to achieve strong performance in various tasks, including text-image retrieval, composed image retrieval, image-image retrieval, and sentence embeddings, even without using multimodal training data. The model is trained on text pairs only, significantly reducing training costs and eliminating the need for costly multimodal training data collection. E5-V demonstrates superior performance compared to existing methods, particularly in tasks requiring understanding of interleaved visual and language inputs. The model's ability to represent multimodal information effectively is validated through extensive experiments across multiple tasks, showing that it can achieve competitive or superior performance compared to state-of-the-art models. The key contributions of E5-V include the development of a new framework for universal multimodal embeddings using LLMs, the use of single modality training to reduce costs, and the demonstration of strong performance in various tasks without the need for multimodal training data. The model's effectiveness is further supported by its ability to follow zero-shot instructions and represent inputs based on detailed prompts.
Reach us at info@study.space