[slides] Visual Instruction Tuning

The paper introduces LLaVA, a large multimodal model that combines a vision encoder and a language model to enhance visual and language understanding. The authors present the first attempt to use GPT-4 to generate multimodal language-image instruction-following data, which is then used for instruction tuning. This approach, called visual instruction tuning, aims to improve the model's ability to follow human instructions in both visual and language tasks. The paper also introduces LLaVA-Bench, a benchmark with diverse and challenging application-oriented tasks, and reports impressive results on the Science QA dataset when fine-tuned with GPT-4. The authors make the generated data, model checkpoints, and code publicly available to facilitate future research in visual instruction following.The paper introduces LLaVA, a large multimodal model that combines a vision encoder and a language model to enhance visual and language understanding. The authors present the first attempt to use GPT-4 to generate multimodal language-image instruction-following data, which is then used for instruction tuning. This approach, called visual instruction tuning, aims to improve the model's ability to follow human instructions in both visual and language tasks. The paper also introduces LLaVA-Bench, a benchmark with diverse and challenging application-oriented tasks, and reports impressive results on the Science QA dataset when fine-tuned with GPT-4. The authors make the generated data, model checkpoints, and code publicly available to facilitate future research in visual instruction following.

Visual Instruction Tuning

11 Dec 2023 | Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee