[slides and audio] TinyLLaVA%3A A Framework of Small-scale Large Multimodal Models

The paper introduces the TinyLLaVA framework, which provides a unified approach to designing and analyzing small-scale Large Multimodal Models (LMMs). The authors empirically study the effects of different vision encoders, connection modules, language models, training data, and training recipes. Their extensive experiments show that better data quality and training recipes can enable smaller LMMs to achieve comparable performance to larger models. The best model, TinyLLaVA-3.1B, outperforms existing 7B models like LLaVA-1.5 and Qwen-VL. The framework consists of a small-scale LLM decoder, a vision encoder, and an intermediate connector, along with training pipelines. The authors investigate various model architectures, datasets, and training recipes, finding that larger models generally perform better but that smaller models can achieve comparable performance with appropriate data and training methods. The findings suggest that the design space of LMMs is under-explored and aim to provide baselines for future research.The paper introduces the TinyLLaVA framework, which provides a unified approach to designing and analyzing small-scale Large Multimodal Models (LMMs). The authors empirically study the effects of different vision encoders, connection modules, language models, training data, and training recipes. Their extensive experiments show that better data quality and training recipes can enable smaller LMMs to achieve comparable performance to larger models. The best model, TinyLLaVA-3.1B, outperforms existing 7B models like LLaVA-1.5 and Qwen-VL. The framework consists of a small-scale LLM decoder, a vision encoder, and an intermediate connector, along with training pipelines. The authors investigate various model architectures, datasets, and training recipes, finding that larger models generally perform better but that smaller models can achieve comparable performance with appropriate data and training methods. The findings suggest that the design space of LMMs is under-explored and aim to provide baselines for future research.

TinyLLaVA: A Framework of Small-scale Large Multimodal Models

22 Feb 2024 | Baichuan Zhou1 Ying Hu2 Xi Weng1 Junlong Jia1 Jie Luo1 Xien Liu2 Ji Wu2 Lei Huang1*