21 Mar 2024 | Yuval Alaluf, Elad Richardson, Sergey Tulyakov, Kfir Aberman, Daniel Cohen-Or
MyVLM: Personalizing VLMs for User-Specific Queries
This paper introduces MyVLM, a method to personalize vision-language models (VLMs) to understand and reason over user-specific concepts. The goal is to enable VLMs to generate personalized captions and answer questions about specific objects or individuals, based on a few images of the concept. The approach involves adding external concept heads to the VLM, which are trained to recognize user-specific concepts. These heads are then used to generate concept embeddings that guide the language model to incorporate the concept into its responses.
The method is applied to BLIP-2 and LLaVA, two popular VLMs, and demonstrates its effectiveness in personalized image captioning and visual question-answering. The technique allows the VLM to recognize and contextualize user-specific concepts, even when the concept appears in new images. The approach is evaluated on a new dataset containing various objects and individuals, and the results show that MyVLM can generalize to new instances of previously learned concepts while preserving the model's behavior on unrelated inputs.
The paper also discusses the limitations of MyVLM, including its reliance on the VLM's inherent biases and the quality of the concept heads. It suggests that further research is needed to improve the robustness of the method, particularly in handling images with many individuals and avoiding context leakage. Overall, MyVLM offers a promising approach to personalizing VLMs, enabling more meaningful human-computer interactions by allowing models to understand and reason over user-specific concepts.MyVLM: Personalizing VLMs for User-Specific Queries
This paper introduces MyVLM, a method to personalize vision-language models (VLMs) to understand and reason over user-specific concepts. The goal is to enable VLMs to generate personalized captions and answer questions about specific objects or individuals, based on a few images of the concept. The approach involves adding external concept heads to the VLM, which are trained to recognize user-specific concepts. These heads are then used to generate concept embeddings that guide the language model to incorporate the concept into its responses.
The method is applied to BLIP-2 and LLaVA, two popular VLMs, and demonstrates its effectiveness in personalized image captioning and visual question-answering. The technique allows the VLM to recognize and contextualize user-specific concepts, even when the concept appears in new images. The approach is evaluated on a new dataset containing various objects and individuals, and the results show that MyVLM can generalize to new instances of previously learned concepts while preserving the model's behavior on unrelated inputs.
The paper also discusses the limitations of MyVLM, including its reliance on the VLM's inherent biases and the quality of the concept heads. It suggests that further research is needed to improve the robustness of the method, particularly in handling images with many individuals and avoiding context leakage. Overall, MyVLM offers a promising approach to personalizing VLMs, enabling more meaningful human-computer interactions by allowing models to understand and reason over user-specific concepts.