MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

2 Oct 2023 | Deyao Zhu*, Jun Chen*, Xiaoqian Shen, Xiang Li, Mohamed Elhoseiny
This paper introduces MiniGPT-4, a vision-language model that aligns a frozen visual encoder with a frozen advanced large language model (LLM), Vicuna, using a single projection layer. The model demonstrates advanced multi-modal abilities similar to those of GPT-4, such as detailed image description generation and website creation from hand-drawn drafts. Additionally, MiniGPT-4 exhibits other emerging capabilities, including writing stories and poems inspired by images, teaching users how to cook based on food photos, and retrieving rich facts about people, movies, or art directly from images. The model is trained on a large collection of aligned image-text pairs and then fine-tuned with a smaller but detailed image description dataset to improve generation reliability and usability. The results show that MiniGPT-4 outperforms BLIP-2 in generating captions that are more closely aligned with the ground-truth visual objects and relationships. The second-stage finetuning significantly improves the quality of generated outputs. However, MiniGPT-4 still faces challenges such as hallucination and spatial information understanding. The study highlights the importance of aligning visual features with an advanced language model to enhance vision-language models.This paper introduces MiniGPT-4, a vision-language model that aligns a frozen visual encoder with a frozen advanced large language model (LLM), Vicuna, using a single projection layer. The model demonstrates advanced multi-modal abilities similar to those of GPT-4, such as detailed image description generation and website creation from hand-drawn drafts. Additionally, MiniGPT-4 exhibits other emerging capabilities, including writing stories and poems inspired by images, teaching users how to cook based on food photos, and retrieving rich facts about people, movies, or art directly from images. The model is trained on a large collection of aligned image-text pairs and then fine-tuned with a smaller but detailed image description dataset to improve generation reliability and usability. The results show that MiniGPT-4 outperforms BLIP-2 in generating captions that are more closely aligned with the ground-truth visual objects and relationships. The second-stage finetuning significantly improves the quality of generated outputs. However, MiniGPT-4 still faces challenges such as hallucination and spatial information understanding. The study highlights the importance of aligning visual features with an advanced language model to enhance vision-language models.
Reach us at info@study.space