Understanding Improved Baselines with Visual Instruction Tuning

This paper presents a systematic study on the design choices of large multimodal models (LMMs) under the LLaVA framework, aiming to improve baselines for visual instruction tuning. The authors demonstrate that the fully-connected vision-language connector in LLaVA is powerful and data-efficient. By making simple modifications, such as using an MLP projection and incorporating academic-task-oriented VQA data with response formatting prompts, they achieve state-of-the-art performance on 11 benchmarks. The final 13B checkpoint uses only 1.2M publicly available data and completes training in about 1 day on a single 8-A100 node. The paper also explores open problems in LMMs, including scaling to higher resolution inputs, compositional capabilities, and model hallucination. The authors hope that these findings will make state-of-the-art LMM research more accessible and provide a reference for future research.This paper presents a systematic study on the design choices of large multimodal models (LMMs) under the LLaVA framework, aiming to improve baselines for visual instruction tuning. The authors demonstrate that the fully-connected vision-language connector in LLaVA is powerful and data-efficient. By making simple modifications, such as using an MLP projection and incorporating academic-task-oriented VQA data with response formatting prompts, they achieve state-of-the-art performance on 11 benchmarks. The final 13B checkpoint uses only 1.2M publicly available data and completes training in about 1 day on a single 8-A100 node. The paper also explores open problems in LMMs, including scaling to higher resolution inputs, compositional capabilities, and model hallucination. The authors hope that these findings will make state-of-the-art LMM research more accessible and provide a reference for future research.

Improved Baselines with Visual Instruction Tuning

15 May 2024 | Haotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee