Yo'LLaVA: Your Personalized Language and Vision Assistant

Yo'LLaVA: Your Personalized Language and Vision Assistant

13 Jun 2024 | Thao Nguyen, Haotian Liu, Yuheng Li, Mu Cai, Utkarsh Ojha, Yong Jae Lee
Yo'LLaVA is a personalized large multimodal model (LMM) that enables conversations about specific subjects. Unlike general LMMs, which lack personalized knowledge, Yo'LLaVA learns to embed a personalized subject into a set of latent tokens based on a few example images. This allows the model to recognize and answer questions about the subject, even without additional context. The model uses a learnable prompt to represent the personalized subject, which is then used to generate responses. Yo'LLaVA is trained using a combination of positive and negative examples to improve its ability to distinguish between similar subjects. The model is designed to retain its pre-trained knowledge while learning new personalized information. Experiments show that Yo'LLaVA outperforms strong prompting baselines in terms of efficiency and effectiveness. The model is also lightweight and can be used for a variety of applications, including health and wellness, education, and entertainment. Yo'LLaVA is open-source, allowing for further research and development. The model is built upon the LLaVA framework and is designed to handle personalized queries, such as recognizing a user's pet dog or a specific person. The model is trained using a dataset of 40 subjects, including people, pets, landmarks, objects, and fictional characters. The model's performance is evaluated on tasks such as recognition and question answering, with results showing that Yo'LLaVA achieves high accuracy in both tasks. The model is also compared to other works, such as MyVLM, and is found to be more effective in terms of accuracy and efficiency. The model's ability to recognize and answer questions about personalized subjects is a significant advancement in the field of LMMs.Yo'LLaVA is a personalized large multimodal model (LMM) that enables conversations about specific subjects. Unlike general LMMs, which lack personalized knowledge, Yo'LLaVA learns to embed a personalized subject into a set of latent tokens based on a few example images. This allows the model to recognize and answer questions about the subject, even without additional context. The model uses a learnable prompt to represent the personalized subject, which is then used to generate responses. Yo'LLaVA is trained using a combination of positive and negative examples to improve its ability to distinguish between similar subjects. The model is designed to retain its pre-trained knowledge while learning new personalized information. Experiments show that Yo'LLaVA outperforms strong prompting baselines in terms of efficiency and effectiveness. The model is also lightweight and can be used for a variety of applications, including health and wellness, education, and entertainment. Yo'LLaVA is open-source, allowing for further research and development. The model is built upon the LLaVA framework and is designed to handle personalized queries, such as recognizing a user's pet dog or a specific person. The model is trained using a dataset of 40 subjects, including people, pets, landmarks, objects, and fictional characters. The model's performance is evaluated on tasks such as recognition and question answering, with results showing that Yo'LLaVA achieves high accuracy in both tasks. The model is also compared to other works, such as MyVLM, and is found to be more effective in terms of accuracy and efficiency. The model's ability to recognize and answer questions about personalized subjects is a significant advancement in the field of LMMs.
Reach us at info@study.space