6 Jun 2024 | Junjie Zhou, Zheng Liu, Shitao Xiao, Bo Zhao, Yongping Xiong
VISTA: Visualized Text Embedding for Universal Multi-Modal Retrieval
VISTA is a new embedding model for universal multi-modal retrieval. It introduces a flexible architecture that extends a powerful text encoder with image understanding capability through visual token embeddings. It also develops two data generation strategies to produce high-quality composed image-text data for training. Additionally, it introduces a multi-stage training algorithm that first aligns the visual token embedding with the text encoder using weakly labeled data, then develops multi-modal representation using the generated data. VISTA achieves superior performance across various multi-modal retrieval tasks in both zero-shot and supervised settings. The model, data, and source code are available at https://github.com/FlagOpen/FlagEmbedding.
The paper presents three technical contributions: a flexible model architecture that enables the generation of multimodal embeddings, two data generation strategies for creating high-quality composed image-text data, and a two-stage training algorithm that first aligns the visual token embedding with the text encoder and then develops multi-modal representation. VISTA is evaluated on various benchmarks and achieves state-of-the-art performance in zero-shot and supervised settings. It outperforms or matches leading approaches in multiple downstream tasks. The model is trained using a combination of cross-modal training and multi-modal training, with the latter using the generated data. The results show that VISTA achieves robust multi-modal embedding capabilities and demonstrates the effectiveness of the generated data. The paper also discusses limitations, including the diversity of image styles in the generated data and the handling of image tokens. The model is designed to be used for multi-modal retrieval and is not suitable for sensitive content. The research is supported by national science and technology projects and natural science foundation of China.VISTA: Visualized Text Embedding for Universal Multi-Modal Retrieval
VISTA is a new embedding model for universal multi-modal retrieval. It introduces a flexible architecture that extends a powerful text encoder with image understanding capability through visual token embeddings. It also develops two data generation strategies to produce high-quality composed image-text data for training. Additionally, it introduces a multi-stage training algorithm that first aligns the visual token embedding with the text encoder using weakly labeled data, then develops multi-modal representation using the generated data. VISTA achieves superior performance across various multi-modal retrieval tasks in both zero-shot and supervised settings. The model, data, and source code are available at https://github.com/FlagOpen/FlagEmbedding.
The paper presents three technical contributions: a flexible model architecture that enables the generation of multimodal embeddings, two data generation strategies for creating high-quality composed image-text data, and a two-stage training algorithm that first aligns the visual token embedding with the text encoder and then develops multi-modal representation. VISTA is evaluated on various benchmarks and achieves state-of-the-art performance in zero-shot and supervised settings. It outperforms or matches leading approaches in multiple downstream tasks. The model is trained using a combination of cross-modal training and multi-modal training, with the latter using the generated data. The results show that VISTA achieves robust multi-modal embedding capabilities and demonstrates the effectiveness of the generated data. The paper also discusses limitations, including the diversity of image styles in the generated data and the handling of image tokens. The model is designed to be used for multi-modal retrieval and is not suitable for sensitive content. The research is supported by national science and technology projects and natural science foundation of China.