AesExpert: Towards Multi-modality Foundation Model for Image Aesthetics Perception

AesExpert: Towards Multi-modality Foundation Model for Image Aesthetics Perception

24 Jul 2024 | Yipo Huang, Xiangfei Sheng, Zhichao Yang, Quan Yuan, Zhichao Duan, Pengfei Chen, Leida Li, Weisi Lin, Guangming Shi
AesExpert: Towards Multi-modality Foundation Model for Image Aesthetics Perception This paper introduces AesExpert, a multi-modality foundation model for image aesthetics perception, which significantly outperforms state-of-the-art large language models (LLMs) in aesthetic perception tasks. The model is built on the AesMMIT dataset, a comprehensive and annotated dataset of 21,904 images and 88K human feedbacks, collected through progressive questions to capture diverse aesthetic perceptions. The dataset is further refined using GPT-4 to generate 409K instruction-following data, covering multiple aesthetic perception dimensions. Based on this dataset, the authors fine-tune open-sourced general foundation models to create AesExpert, which demonstrates superior performance in aesthetic perception tasks compared to GPT-4V and Gemini-Pro-Vision. The model is evaluated on the AesBench benchmark, showing significant improvements in aesthetic perception, empathy, assessment, and interpretation abilities. The AesMMIT dataset is also shown to be effective in improving the performance of multi-modality foundation models. The authors release the dataset and model to the community for further research and development.AesExpert: Towards Multi-modality Foundation Model for Image Aesthetics Perception This paper introduces AesExpert, a multi-modality foundation model for image aesthetics perception, which significantly outperforms state-of-the-art large language models (LLMs) in aesthetic perception tasks. The model is built on the AesMMIT dataset, a comprehensive and annotated dataset of 21,904 images and 88K human feedbacks, collected through progressive questions to capture diverse aesthetic perceptions. The dataset is further refined using GPT-4 to generate 409K instruction-following data, covering multiple aesthetic perception dimensions. Based on this dataset, the authors fine-tune open-sourced general foundation models to create AesExpert, which demonstrates superior performance in aesthetic perception tasks compared to GPT-4V and Gemini-Pro-Vision. The model is evaluated on the AesBench benchmark, showing significant improvements in aesthetic perception, empathy, assessment, and interpretation abilities. The AesMMIT dataset is also shown to be effective in improving the performance of multi-modality foundation models. The authors release the dataset and model to the community for further research and development.
Reach us at info@study.space