Understanding VISTA%3A Visualized Text Embedding For Universal Multi-Modal Retrieval

VISTA: Visualized Text Embedding For Universal Multi-Modal Retrieval **Authors:** Junjie Zhou, Zheng Liu, Shitao Xiao, Bo Zhao, Yongping Xiong **Institutional Affiliations:** State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications; Beijing Academy of Artificial Intelligence; The Hong Kong Polytechnic University **Abstract:** Multi-modal retrieval is gaining popularity, but existing retrievers are primarily text-oriented, lacking the ability to process visual information. Despite the availability of vision-language models like CLIP, these methods struggle with representing text-only and image-only data effectively. This paper introduces VISTA, a new embedding model for universal multi-modal retrieval. VISTA makes three key contributions: 1. A flexible architecture that extends a powerful text encoder with visual-to-text embeddings. 2. Two data generation strategies to create high-quality composed image-text pairs for training. 3. A multi-stage training algorithm that first aligns visual tokens with the text encoder using weakly labeled data and then develops multi-modal representation capabilities using generated image-text pairs. **Key Contributions:** - **Flexible Architecture:** VISTA integrates a powerful text encoder with an image encoder, enabling in-depth fusion of text and image data. - **Data Generation:** Two innovative pipelines generate large-scale, high-quality image-text datasets for training. - **Multi-Stage Training:** A two-stage training algorithm enhances VISTA's multi-modal embedding capabilities. **Experimental Results:** - **Zero-Shot Retrieval:** VISTA achieves superior performance across various multi-modal retrieval tasks in both zero-shot and supervised settings. - **Supervised Fine-Tuning:** VISTA outperforms state-of-the-art methods in fine-tuning on benchmarks like WebQA, CIRR, and ReMuQ. **Conclusion:** VISTA is a versatile and effective model for universal multi-modal retrieval, demonstrating superior performance in both zero-shot and supervised settings. The model's architecture and training strategies highlight its potential for handling a wide range of multi-modal retrieval tasks.VISTA: Visualized Text Embedding For Universal Multi-Modal Retrieval **Authors:** Junjie Zhou, Zheng Liu, Shitao Xiao, Bo Zhao, Yongping Xiong **Institutional Affiliations:** State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications; Beijing Academy of Artificial Intelligence; The Hong Kong Polytechnic University **Abstract:** Multi-modal retrieval is gaining popularity, but existing retrievers are primarily text-oriented, lacking the ability to process visual information. Despite the availability of vision-language models like CLIP, these methods struggle with representing text-only and image-only data effectively. This paper introduces VISTA, a new embedding model for universal multi-modal retrieval. VISTA makes three key contributions: 1. A flexible architecture that extends a powerful text encoder with visual-to-text embeddings. 2. Two data generation strategies to create high-quality composed image-text pairs for training. 3. A multi-stage training algorithm that first aligns visual tokens with the text encoder using weakly labeled data and then develops multi-modal representation capabilities using generated image-text pairs. **Key Contributions:** - **Flexible Architecture:** VISTA integrates a powerful text encoder with an image encoder, enabling in-depth fusion of text and image data. - **Data Generation:** Two innovative pipelines generate large-scale, high-quality image-text datasets for training. - **Multi-Stage Training:** A two-stage training algorithm enhances VISTA's multi-modal embedding capabilities. **Experimental Results:** - **Zero-Shot Retrieval:** VISTA achieves superior performance across various multi-modal retrieval tasks in both zero-shot and supervised settings. - **Supervised Fine-Tuning:** VISTA outperforms state-of-the-art methods in fine-tuning on benchmarks like WebQA, CIRR, and ReMuQ. **Conclusion:** VISTA is a versatile and effective model for universal multi-modal retrieval, demonstrating superior performance in both zero-shot and supervised settings. The model's architecture and training strategies highlight its potential for handling a wide range of multi-modal retrieval tasks.

VISTA: Visualized Text Embedding For Universal Multi-Modal Retrieval

6 Jun 2024 | Junjie Zhou1*, Zheng Liu2,3†, Shitao Xiao2, Bo Zhao2, Yongping Xiong1