Efficient Multimodal Large Language Models: A Survey

Efficient Multimodal Large Language Models: A Survey

9 Aug 2024 | Yizhang Jin, Jian Li, Yexin Liu, Tianjun Gu, Kai Wu, Zhengkai Jiang, Muyang He, Bo Zhao, Xin Tan, Zhenye Gan, Yabiao Wang, Chengjie Wang, Lizhuang Ma
This survey provides a comprehensive review of efficient Multimodal Large Language Models (MLLMs), addressing the challenges of high resource demands and computational costs associated with large-scale models. The paper highlights the importance of reducing resource consumption to broaden the applicability of MLLMs, particularly in edge computing scenarios. It discusses the timeline of representative efficient MLLMs, research on efficient structures and strategies, and their applications. The survey is organized into six categories: architecture, efficient vision, efficient LLMs, training, data and benchmarks, and applications. Key topics include lightweight vision encoders, efficient vision-language projectors, small language models, vision token compression, and efficient structures like Mixture-of-Experts and Mamba. The paper also explores techniques for efficient attention mechanisms, such as sharing-based attention, feature information reduction, and approximate attention. Additionally, it covers parameter-efficient fine-tuning and the use of state space models as alternatives to attention mechanisms. The survey aims to provide a roadmap for future research and to highlight the potential of efficient MLLMs in various domains.This survey provides a comprehensive review of efficient Multimodal Large Language Models (MLLMs), addressing the challenges of high resource demands and computational costs associated with large-scale models. The paper highlights the importance of reducing resource consumption to broaden the applicability of MLLMs, particularly in edge computing scenarios. It discusses the timeline of representative efficient MLLMs, research on efficient structures and strategies, and their applications. The survey is organized into six categories: architecture, efficient vision, efficient LLMs, training, data and benchmarks, and applications. Key topics include lightweight vision encoders, efficient vision-language projectors, small language models, vision token compression, and efficient structures like Mixture-of-Experts and Mamba. The paper also explores techniques for efficient attention mechanisms, such as sharing-based attention, feature information reduction, and approximate attention. Additionally, it covers parameter-efficient fine-tuning and the use of state space models as alternatives to attention mechanisms. The survey aims to provide a roadmap for future research and to highlight the potential of efficient MLLMs in various domains.
Reach us at info@study.space
[slides] Efficient Multimodal Large Language Models%3A A Survey | StudySpace