Visual Instruction Tuning

Visual Instruction Tuning

11 Dec 2023 | Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee
This paper introduces LLaVA, a large multimodal model that connects a vision encoder and an LLM to enable general-purpose visual and language understanding. The model is trained using visual instruction-tuning, which involves generating instruction-following data using GPT-4 to create a diverse set of multimodal instruction-following examples. The paper presents two evaluation benchmarks, LLaVA-Bench, which includes diverse and challenging application-oriented tasks. Experiments show that LLaVA demonstrates impressive multimodal chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%. The paper also presents a novel approach to visual instruction-tuning, which involves generating instruction-following data using GPT-4 to create a diverse set of multimodal instruction-following examples. The paper also presents a new benchmark for evaluating multimodal instruction-following capabilities. The model is publicly available, along with the generated data, codebase, and model checkpoints. The paper also discusses the limitations of the model and provides insights into its performance on challenging tasks. The paper concludes that visual instruction-tuning is an effective approach for developing multimodal models that can follow human instructions and complete visual tasks. The paper also highlights the importance of using a combination of models, such as LLaVA and GPT-4, to achieve state-of-the-art performance on multimodal tasks.This paper introduces LLaVA, a large multimodal model that connects a vision encoder and an LLM to enable general-purpose visual and language understanding. The model is trained using visual instruction-tuning, which involves generating instruction-following data using GPT-4 to create a diverse set of multimodal instruction-following examples. The paper presents two evaluation benchmarks, LLaVA-Bench, which includes diverse and challenging application-oriented tasks. Experiments show that LLaVA demonstrates impressive multimodal chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%. The paper also presents a novel approach to visual instruction-tuning, which involves generating instruction-following data using GPT-4 to create a diverse set of multimodal instruction-following examples. The paper also presents a new benchmark for evaluating multimodal instruction-following capabilities. The model is publicly available, along with the generated data, codebase, and model checkpoints. The paper also discusses the limitations of the model and provides insights into its performance on challenging tasks. The paper concludes that visual instruction-tuning is an effective approach for developing multimodal models that can follow human instructions and complete visual tasks. The paper also highlights the importance of using a combination of models, such as LLaVA and GPT-4, to achieve state-of-the-art performance on multimodal tasks.
Reach us at info@study.space