Yo'LLaVA: Your Personalized Language and Vision Assistant

Yo'LLaVA: Your Personalized Language and Vision Assistant

13 Jun 2024 | Thao Nguyen, Haotian Liu, Yuheng Li, Mu Cai, Utkarsh Ojha, Yong Jae Lee
Yo’LLaVA is a novel personalized Large Multimodal Model (LMM) designed to facilitate conversations and recognize specific subjects from a few images. Unlike traditional LMMs, which are generic and lack the ability to handle personalized subjects, Yo’LLaVA learns to embed a personalized subject into a set of latent tokens using a learnable prompt. This approach allows the model to answer questions about the subject, recognize it in new images, and engage in natural conversations without reference images. The key contributions of Yo’LLaVA include: 1. **Personalized LMMs**: Introduces a novel task of personalizing LMMs to adapt to and answer user-specific concepts. 2. **Efficient Framework**: Learns personalized concepts with only a few images while retaining broad pre-trained knowledge. 3. **Training Dataset**: Creates a new dataset specifically designed for personalizing LMMs. 4. **Open Source**: Will release training and evaluation data, as well as code and models. Yo’LLaVA addresses two main challenges: 1. **Efficient Learning**: Ensures that the model's broad pre-trained knowledge is not affected, using a set of learnable input tokens. 2. **Fine-Grained Visual Details**: Captures detailed visual attributes of the personalized subject through hard negative mining. Experiments show that Yo’LLaVA outperforms strong prompting baselines (e.g., GPT-4 and LLaVA) in recognizing personalized subjects and performing question answering tasks. The model demonstrates high accuracy in recognizing subjects in new images and answering questions about their visual characteristics, even without reference images. Additionally, Yo’LLaVA is more efficient in token usage compared to other methods. The paper also includes ablation studies to validate the effectiveness of the number of trainable tokens and training images, as well as a comparison with a concurrent work, MyVLM. The broader impact of Yo’LLaVA is discussed, including potential risks such as hallucination and bias, and strategies to mitigate these issues.Yo’LLaVA is a novel personalized Large Multimodal Model (LMM) designed to facilitate conversations and recognize specific subjects from a few images. Unlike traditional LMMs, which are generic and lack the ability to handle personalized subjects, Yo’LLaVA learns to embed a personalized subject into a set of latent tokens using a learnable prompt. This approach allows the model to answer questions about the subject, recognize it in new images, and engage in natural conversations without reference images. The key contributions of Yo’LLaVA include: 1. **Personalized LMMs**: Introduces a novel task of personalizing LMMs to adapt to and answer user-specific concepts. 2. **Efficient Framework**: Learns personalized concepts with only a few images while retaining broad pre-trained knowledge. 3. **Training Dataset**: Creates a new dataset specifically designed for personalizing LMMs. 4. **Open Source**: Will release training and evaluation data, as well as code and models. Yo’LLaVA addresses two main challenges: 1. **Efficient Learning**: Ensures that the model's broad pre-trained knowledge is not affected, using a set of learnable input tokens. 2. **Fine-Grained Visual Details**: Captures detailed visual attributes of the personalized subject through hard negative mining. Experiments show that Yo’LLaVA outperforms strong prompting baselines (e.g., GPT-4 and LLaVA) in recognizing personalized subjects and performing question answering tasks. The model demonstrates high accuracy in recognizing subjects in new images and answering questions about their visual characteristics, even without reference images. Additionally, Yo’LLaVA is more efficient in token usage compared to other methods. The paper also includes ablation studies to validate the effectiveness of the number of trainable tokens and training images, as well as a comparison with a concurrent work, MyVLM. The broader impact of Yo’LLaVA is discussed, including potential risks such as hallucination and bias, and strategies to mitigate these issues.
Reach us at info@study.space
[slides] Yo'LLaVA%3A Your Personalized Language and Vision Assistant | StudySpace