6 Jun 2024 | Davide Caffagni, Federico Cocchi, Luca Barsellotti, Nicholas Moratelli, Sara Sarto, Lorenzo Baraldi, Marcella Cornia, Rita Cucchiara
This survey provides a comprehensive overview of recent advancements in Multimodal Large Language Models (MLLMs), focusing on their architecture, training methodologies, and applications. MLLMs integrate visual and textual modalities, enabling tasks such as visual grounding, image generation, and understanding. The paper analyzes various MLLMs, their visual encoders, and adapters that facilitate cross-modal alignment. It discusses the use of pre-trained models, parameter-efficient fine-tuning techniques, and the role of vision-to-language adapters in enhancing model performance. The survey also covers training data, evaluation benchmarks, and computational requirements of different MLLMs. It highlights the importance of visual encoders, such as CLIP-based models, and their impact on model performance. The paper also explores the use of cross-attention mechanisms and other advanced techniques to improve multimodal understanding. Additionally, it addresses challenges in MLLM development, including hallucination correction, harmful generation, and computational efficiency. The survey concludes with future research directions, emphasizing the need for more robust and efficient MLLMs capable of handling diverse modalities and applications.This survey provides a comprehensive overview of recent advancements in Multimodal Large Language Models (MLLMs), focusing on their architecture, training methodologies, and applications. MLLMs integrate visual and textual modalities, enabling tasks such as visual grounding, image generation, and understanding. The paper analyzes various MLLMs, their visual encoders, and adapters that facilitate cross-modal alignment. It discusses the use of pre-trained models, parameter-efficient fine-tuning techniques, and the role of vision-to-language adapters in enhancing model performance. The survey also covers training data, evaluation benchmarks, and computational requirements of different MLLMs. It highlights the importance of visual encoders, such as CLIP-based models, and their impact on model performance. The paper also explores the use of cross-attention mechanisms and other advanced techniques to improve multimodal understanding. Additionally, it addresses challenges in MLLM development, including hallucination correction, harmful generation, and computational efficiency. The survey concludes with future research directions, emphasizing the need for more robust and efficient MLLMs capable of handling diverse modalities and applications.