Improved Baselines with Visual Instruction Tuning

Improved Baselines with Visual Instruction Tuning

15 May 2024 | Haotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee
This paper presents a systematic study on the design choices of large multimodal models (LMMs) under the LLaVA framework. The authors show that the fully-connected vision-language connector in LLaVA is surprisingly powerful and data-efficient. By making simple modifications to LLaVA, such as using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with response formatting prompts, they establish stronger baselines that achieve state-of-the-art results across 11 benchmarks. Their final 13B checkpoint uses only 1.2M publicly available data and completes full training in about one day on a single 8-A100 node. They also explore open problems in LMMs, including scaling to higher resolution inputs, compositional capabilities, and model hallucination. The authors hope this makes state-of-the-art LMM research more accessible. Code and model will be publicly available. The paper introduces LLaVA-1.5, a simple yet effective approach to balance multitask learning and effective scaling for large multimodal models. LLaVA-1.5 uses only public data, achieves state-of-the-art results on a broad range of 11 tasks, and is significantly more data-efficient than previous approaches. By rethinking conventional approaches and exploring open problems in visual instruction tuning, the authors pave the way for more robust and capable systems for LMMs. They hope these improved and easily-reproducible baselines will provide a reference for future research in open-source LMMs.This paper presents a systematic study on the design choices of large multimodal models (LMMs) under the LLaVA framework. The authors show that the fully-connected vision-language connector in LLaVA is surprisingly powerful and data-efficient. By making simple modifications to LLaVA, such as using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with response formatting prompts, they establish stronger baselines that achieve state-of-the-art results across 11 benchmarks. Their final 13B checkpoint uses only 1.2M publicly available data and completes full training in about one day on a single 8-A100 node. They also explore open problems in LMMs, including scaling to higher resolution inputs, compositional capabilities, and model hallucination. The authors hope this makes state-of-the-art LMM research more accessible. Code and model will be publicly available. The paper introduces LLaVA-1.5, a simple yet effective approach to balance multitask learning and effective scaling for large multimodal models. LLaVA-1.5 uses only public data, achieves state-of-the-art results on a broad range of 11 tasks, and is significantly more data-efficient than previous approaches. By rethinking conventional approaches and exploring open problems in visual instruction tuning, the authors pave the way for more robust and capable systems for LMMs. They hope these improved and easily-reproducible baselines will provide a reference for future research in open-source LMMs.
Reach us at info@study.space