AesExpert: Towards Multi-modality Foundation Model for Image Aesthetics Perception

AesExpert: Towards Multi-modality Foundation Model for Image Aesthetics Perception

24 Jul 2024 | Yipo Huang, Xiangfei Sheng, Zhichao Yang, Quan Yuan, Zhichao Duan, Pengfei Chen, Leida Li, Weisi Lin, Guangming Shi
The paper "AesExpert: Towards Multi-modality Foundation Model for Image Aesthetics Perception" addresses the challenge of image aesthetics perception (IAP) in multimodal large language models (MLLMs). The authors introduce a comprehensive annotated dataset called Aesthetic Multi-Modality Instruction Tuning (AesMMIT), which includes 21,904 diverse-sourced images and 88K human natural language feedbacks. This dataset is designed to align MLLMs with human aesthetics perception by capturing a wide range of aesthetic dimensions, from coarse-grained evaluations to fine-grained descriptions. To enhance the model's ability to handle diverse queries, the authors use GPT-4 to refine the aesthetic critiques and assemble the AesMMIT dataset, which consists of 409K multi-typed instructions. Based on the AesMMIT dataset, the authors fine-tune open-sourced general foundation models, resulting in multi-modality Aesthetic Expert models named AesExpert. Extensive experiments demonstrate that AesExpert models outperform state-of-the-art MLLMs, including GPT-4V and Gemini-Pro-Vision, in various aesthetic perception tasks. The paper also includes a detailed dataset construction process, model architecture, and performance comparisons with existing models. The authors conclude that their approach significantly improves the aesthetic perception abilities of MLLMs and encourages further research in this area.The paper "AesExpert: Towards Multi-modality Foundation Model for Image Aesthetics Perception" addresses the challenge of image aesthetics perception (IAP) in multimodal large language models (MLLMs). The authors introduce a comprehensive annotated dataset called Aesthetic Multi-Modality Instruction Tuning (AesMMIT), which includes 21,904 diverse-sourced images and 88K human natural language feedbacks. This dataset is designed to align MLLMs with human aesthetics perception by capturing a wide range of aesthetic dimensions, from coarse-grained evaluations to fine-grained descriptions. To enhance the model's ability to handle diverse queries, the authors use GPT-4 to refine the aesthetic critiques and assemble the AesMMIT dataset, which consists of 409K multi-typed instructions. Based on the AesMMIT dataset, the authors fine-tune open-sourced general foundation models, resulting in multi-modality Aesthetic Expert models named AesExpert. Extensive experiments demonstrate that AesExpert models outperform state-of-the-art MLLMs, including GPT-4V and Gemini-Pro-Vision, in various aesthetic perception tasks. The paper also includes a detailed dataset construction process, model architecture, and performance comparisons with existing models. The authors conclude that their approach significantly improves the aesthetic perception abilities of MLLMs and encourages further research in this area.
Reach us at info@study.space