The Revolution of Multimodal Large Language Models: A Survey

The Revolution of Multimodal Large Language Models: A Survey

6 Jun 2024 | Davide Caffagni, Federico Cocchi, Luca Barsellotti, Nicholas Moratelli, Sara Sarto, Lorenzo Baraldi, Lorenzo Baraldi, Marcella Cornia, Rita Cucchiara
This paper provides a comprehensive review of Multimodal Large Language Models (MLLMs), focusing on their architectural choices, multimodal alignment strategies, and training techniques. The authors analyze a wide range of tasks, including visual grounding, image generation and editing, visual understanding, and domain-specific applications. They also compile and describe training datasets and evaluation benchmarks, comparing the performance and computational requirements of existing models. The survey highlights the current state of MLLMs and lays the groundwork for future developments. Key aspects discussed include the architecture of MLLMs, the use of visual encoders and adapters, and the training processes and data utilized. The paper identifies three core aspects of MLLMs: their architecture, training methodologies, and the tasks they are designed to perform. It also discusses the challenges and promising directions for future research, such as multimodal retrieval-augmented generation, correcting hallucinations, preventing harmful and biased generation, and reducing computational load.This paper provides a comprehensive review of Multimodal Large Language Models (MLLMs), focusing on their architectural choices, multimodal alignment strategies, and training techniques. The authors analyze a wide range of tasks, including visual grounding, image generation and editing, visual understanding, and domain-specific applications. They also compile and describe training datasets and evaluation benchmarks, comparing the performance and computational requirements of existing models. The survey highlights the current state of MLLMs and lays the groundwork for future developments. Key aspects discussed include the architecture of MLLMs, the use of visual encoders and adapters, and the training processes and data utilized. The paper identifies three core aspects of MLLMs: their architecture, training methodologies, and the tasks they are designed to perform. It also discusses the challenges and promising directions for future research, such as multimodal retrieval-augmented generation, correcting hallucinations, preventing harmful and biased generation, and reducing computational load.
Reach us at info@study.space