6 Jun 2024 | Junjie Zhou1*, Zheng Liu2,3†, Shitao Xiao2, Bo Zhao2, Yongping Xiong1
VISTA: Visualized Text Embedding For Universal Multi-Modal Retrieval
**Authors:** Junjie Zhou, Zheng Liu, Shitao Xiao, Bo Zhao, Yongping Xiong
**Institutional Affiliations:** State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications; Beijing Academy of Artificial Intelligence; The Hong Kong Polytechnic University
**Abstract:**
Multi-modal retrieval is gaining popularity, but existing retrievers are primarily text-oriented, lacking the ability to process visual information. Despite the availability of vision-language models like CLIP, these methods struggle with representing text-only and image-only data effectively. This paper introduces VISTA, a new embedding model for universal multi-modal retrieval. VISTA makes three key contributions:
1. A flexible architecture that extends a powerful text encoder with visual-to-text embeddings.
2. Two data generation strategies to create high-quality composed image-text pairs for training.
3. A multi-stage training algorithm that first aligns visual tokens with the text encoder using weakly labeled data and then develops multi-modal representation capabilities using generated image-text pairs.
**Key Contributions:**
- **Flexible Architecture:** VISTA integrates a powerful text encoder with an image encoder, enabling in-depth fusion of text and image data.
- **Data Generation:** Two innovative pipelines generate large-scale, high-quality image-text datasets for training.
- **Multi-Stage Training:** A two-stage training algorithm enhances VISTA's multi-modal embedding capabilities.
**Experimental Results:**
- **Zero-Shot Retrieval:** VISTA achieves superior performance across various multi-modal retrieval tasks in both zero-shot and supervised settings.
- **Supervised Fine-Tuning:** VISTA outperforms state-of-the-art methods in fine-tuning on benchmarks like WebQA, CIRR, and ReMuQ.
**Conclusion:**
VISTA is a versatile and effective model for universal multi-modal retrieval, demonstrating superior performance in both zero-shot and supervised settings. The model's architecture and training strategies highlight its potential for handling a wide range of multi-modal retrieval tasks.VISTA: Visualized Text Embedding For Universal Multi-Modal Retrieval
**Authors:** Junjie Zhou, Zheng Liu, Shitao Xiao, Bo Zhao, Yongping Xiong
**Institutional Affiliations:** State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications; Beijing Academy of Artificial Intelligence; The Hong Kong Polytechnic University
**Abstract:**
Multi-modal retrieval is gaining popularity, but existing retrievers are primarily text-oriented, lacking the ability to process visual information. Despite the availability of vision-language models like CLIP, these methods struggle with representing text-only and image-only data effectively. This paper introduces VISTA, a new embedding model for universal multi-modal retrieval. VISTA makes three key contributions:
1. A flexible architecture that extends a powerful text encoder with visual-to-text embeddings.
2. Two data generation strategies to create high-quality composed image-text pairs for training.
3. A multi-stage training algorithm that first aligns visual tokens with the text encoder using weakly labeled data and then develops multi-modal representation capabilities using generated image-text pairs.
**Key Contributions:**
- **Flexible Architecture:** VISTA integrates a powerful text encoder with an image encoder, enabling in-depth fusion of text and image data.
- **Data Generation:** Two innovative pipelines generate large-scale, high-quality image-text datasets for training.
- **Multi-Stage Training:** A two-stage training algorithm enhances VISTA's multi-modal embedding capabilities.
**Experimental Results:**
- **Zero-Shot Retrieval:** VISTA achieves superior performance across various multi-modal retrieval tasks in both zero-shot and supervised settings.
- **Supervised Fine-Tuning:** VISTA outperforms state-of-the-art methods in fine-tuning on benchmarks like WebQA, CIRR, and ReMuQ.
**Conclusion:**
VISTA is a versatile and effective model for universal multi-modal retrieval, demonstrating superior performance in both zero-shot and supervised settings. The model's architecture and training strategies highlight its potential for handling a wide range of multi-modal retrieval tasks.