[slides and audio] A Review of Multi-Modal Large Language and Vision Models

This paper provides an extensive review of Multi-Modal Large Language Models (MM-LLMs), focusing on their historical development, technical aspects, and practical applications. It covers the evolution of Large Language Models (LLMs) from rule-based and statistical approaches to the transformative role of Transformer architectures, particularly attention mechanisms. The paper discusses the advantages and challenges of proprietary versus open-source LLMs, highlighting the benefits of transparency, cost-effectiveness, and ethical considerations. It reviews several prominent LLMs, including GPT, Claude, Gemini, LLaMA, MedAlpaca, Mistral 7B, Falcon, Grok-1, and vision models like BLIP-2, CLIP, LLaVA, Kosmos-1, MiniGPT4, and mPLUG-OWL. The review also delves into model tuning techniques such as fine-tuning, prompt engineering, and reinforcement learning, and addresses ethical concerns and the implications of open-source versus proprietary models. Finally, it discusses the performance evaluation and benchmarking of these models, emphasizing the importance of assessing their capabilities in various tasks.This paper provides an extensive review of Multi-Modal Large Language Models (MM-LLMs), focusing on their historical development, technical aspects, and practical applications. It covers the evolution of Large Language Models (LLMs) from rule-based and statistical approaches to the transformative role of Transformer architectures, particularly attention mechanisms. The paper discusses the advantages and challenges of proprietary versus open-source LLMs, highlighting the benefits of transparency, cost-effectiveness, and ethical considerations. It reviews several prominent LLMs, including GPT, Claude, Gemini, LLaMA, MedAlpaca, Mistral 7B, Falcon, Grok-1, and vision models like BLIP-2, CLIP, LLaVA, Kosmos-1, MiniGPT4, and mPLUG-OWL. The review also delves into model tuning techniques such as fine-tuning, prompt engineering, and reinforcement learning, and addresses ethical concerns and the implications of open-source versus proprietary models. Finally, it discusses the performance evaluation and benchmarking of these models, emphasizing the importance of assessing their capabilities in various tasks.

A Review of Multi-Modal Large Language and Vision Models

28 Mar 2024 | Kilian Carolan, Laura Fennelly, and Alan F. Smeaton