TinyLLaVA: A Framework of Small-scale Large Multimodal Models

TinyLLaVA: A Framework of Small-scale Large Multimodal Models

22 Feb 2024 | Baichuan Zhou, Ying Hu, Xi Weng, Junlong Jia, Jie Luo, Xien Liu, Ji Wu, Lei Huang
TinyLLaVA is a framework for small-scale large multimodal models (LMMs). The framework includes a vision encoder, small-scale LLM decoder, and intermediate connector, along with training pipelines. The study empirically investigates the effects of different vision encoders, connection modules, language models, training data, and training recipes. The results show that smaller LMMs can achieve on-par performance with larger ones when combined with better data and training recipes. The best model, TinyLLaVA-3.1B, outperforms existing 7B models like LLaVA-1.5 and Qwen-VL. The framework provides a unified perspective for designing and analyzing small-scale LMMs, offering insights into the design space of LMMs. The study highlights the potential of small-scale LMMs and their ability to achieve competitive performance with larger models. The findings suggest that further research in data scaling, training setups, and model selection can benefit from the insights provided by TinyLLaVA.TinyLLaVA is a framework for small-scale large multimodal models (LMMs). The framework includes a vision encoder, small-scale LLM decoder, and intermediate connector, along with training pipelines. The study empirically investigates the effects of different vision encoders, connection modules, language models, training data, and training recipes. The results show that smaller LMMs can achieve on-par performance with larger ones when combined with better data and training recipes. The best model, TinyLLaVA-3.1B, outperforms existing 7B models like LLaVA-1.5 and Qwen-VL. The framework provides a unified perspective for designing and analyzing small-scale LMMs, offering insights into the design space of LMMs. The study highlights the potential of small-scale LMMs and their ability to achieve competitive performance with larger models. The findings suggest that further research in data scaling, training setups, and model selection can benefit from the insights provided by TinyLLaVA.
Reach us at info@study.space