This paper introduces a novel generative cross-modal retrieval framework, GRACE, which enables multimodal large language models (MLLMs) to memorize and recall images within their parameters. The framework assigns unique identifier strings to images and involves two training steps: learning to memorize and learning to retrieve. The first step trains the MLLM to memorize the association between images and their respective identifiers, while the second step teaches the MLLM to generate the corresponding identifier of the target image given a textual query. GRACE introduces a new paradigm for cross-modal retrieval, distinct from previous discriminative approaches. The experiments demonstrate that the generative paradigm performs effectively and efficiently even with large-scale image candidate sets.
GRACE assigns images unique identifiers, where each identifier is a distinct string representing an image. Based on the identifiers, GRACE comprises two training steps. The first step focuses on training the MLLM to memorize the association between images and their respective identifiers. The latter step teaches the MLLM to generate the identifier string of the target image given a textual query. By memorizing images in MLLMs, GRACE introduces a new paradigm to cross-modal retrieval, distinct from previous discriminative approaches. The experiments demonstrate that the generative paradigm performs effectively and efficiently even with large-scale image candidate sets.
GRACE is evaluated on text-image matching datasets to verify the feasibility of generative cross-modal retrieval. Without any image's visual information during inference, GRACE performs comparably to the advance one-tower approaches (e.g., CLIP) and demonstrates higher efficiency with large-scale image sizes. It is acknowledged that as a new retrieval paradigm, GRACE still lags behind one-tower approaches. One-tower approaches are only applicable to ranking stage due to their low efficiency, while GRACE and CILP are specifically designed for the retrieval stage. By comprehensive analysis, we hope to comprehensively understand its capabilities and limitations.
We believe exploring generative cross-modal retrieval holds great significance. Benefiting from inbuilt visual memory within MLLMs, GRACE introduces a new paradigm to cross-modal retrieval. GRACE transforms the original matching problem into a generation problem, eliminating the need for negative samples during training and retrieval index during inference. No matter the size of the image set, the retrieval efficiency remains constant. This new cross-modal retrieval paradigm leaves much room for investigation.
Inbuilt visual memory serves for retrieval, yet its utility extends beyond mere retrieval. In Section 4.5, we demonstrate that the MLLM could describe the memorized image and even answer questions about the memorized images, just like humans do. This opens up the possibility of injecting personalized visual experiences of humans into MLLMs for them to memorize and understand an individual's journey, and accomplish more visual tasks.This paper introduces a novel generative cross-modal retrieval framework, GRACE, which enables multimodal large language models (MLLMs) to memorize and recall images within their parameters. The framework assigns unique identifier strings to images and involves two training steps: learning to memorize and learning to retrieve. The first step trains the MLLM to memorize the association between images and their respective identifiers, while the second step teaches the MLLM to generate the corresponding identifier of the target image given a textual query. GRACE introduces a new paradigm for cross-modal retrieval, distinct from previous discriminative approaches. The experiments demonstrate that the generative paradigm performs effectively and efficiently even with large-scale image candidate sets.
GRACE assigns images unique identifiers, where each identifier is a distinct string representing an image. Based on the identifiers, GRACE comprises two training steps. The first step focuses on training the MLLM to memorize the association between images and their respective identifiers. The latter step teaches the MLLM to generate the identifier string of the target image given a textual query. By memorizing images in MLLMs, GRACE introduces a new paradigm to cross-modal retrieval, distinct from previous discriminative approaches. The experiments demonstrate that the generative paradigm performs effectively and efficiently even with large-scale image candidate sets.
GRACE is evaluated on text-image matching datasets to verify the feasibility of generative cross-modal retrieval. Without any image's visual information during inference, GRACE performs comparably to the advance one-tower approaches (e.g., CLIP) and demonstrates higher efficiency with large-scale image sizes. It is acknowledged that as a new retrieval paradigm, GRACE still lags behind one-tower approaches. One-tower approaches are only applicable to ranking stage due to their low efficiency, while GRACE and CILP are specifically designed for the retrieval stage. By comprehensive analysis, we hope to comprehensively understand its capabilities and limitations.
We believe exploring generative cross-modal retrieval holds great significance. Benefiting from inbuilt visual memory within MLLMs, GRACE introduces a new paradigm to cross-modal retrieval. GRACE transforms the original matching problem into a generation problem, eliminating the need for negative samples during training and retrieval index during inference. No matter the size of the image set, the retrieval efficiency remains constant. This new cross-modal retrieval paradigm leaves much room for investigation.
Inbuilt visual memory serves for retrieval, yet its utility extends beyond mere retrieval. In Section 4.5, we demonstrate that the MLLM could describe the memorized image and even answer questions about the memorized images, just like humans do. This opens up the possibility of injecting personalized visual experiences of humans into MLLMs for them to memorize and understand an individual's journey, and accomplish more visual tasks.