A Comprehensive Study of Multimodal Large Language Models for Image Quality Assessment

A Comprehensive Study of Multimodal Large Language Models for Image Quality Assessment

11 Jul 2024 | Tianhe Wu, Kede Ma, Jie Liang, Yujiu Yang, Lei Zhang
This paper presents a comprehensive study of using Multimodal Large Language Models (MLLMs) for Image Quality Assessment (IQA). The authors investigate nine prompting systems for MLLMs, combining three standardized psychophysical testing procedures (single-stimulus, double-stimulus, and multiple-stimulus methods) with three popular prompting strategies in natural language processing (standard, in-context, and chain-of-thought prompting). They also propose a difficult sample selection procedure to challenge MLLMs with diverse and uncertain samples. The study evaluates three open-source and one closed-source MLLMs on several visual attributes of image quality in both full-reference (FR) and no-reference (NR) scenarios. The results show that only the closed-source GPT-4V provides a reasonable account for human perception of image quality, but is weak at discriminating fine-grained quality variations and comparing visual quality of multiple images. The study highlights that different MLLMs require different prompting systems to perform optimally. It also demonstrates that there is still room for improvement in MLLMs for IQA, especially in fine-grained quality discrimination and multiple-image quality analysis. The authors argue that directly fine-tuning open-source MLLMs on datasets with image quality descriptions may not be effective due to the risk of catastrophic forgetting. The paper also discusses the limitations of the current prompting systems and opportunities for future work, including automatic prompt optimization, extending the sampler to large-scale unlabeled image sets, and exploring instruction tuning to enhance IQA performance. The study emphasizes the importance of sample selection when evaluating MLLMs for IQA, due to the high cost of inference. The authors propose a computational procedure for difficult sample selection, taking into account both sample diversity and uncertainty. The results show that the chain-of-thought prompting consistently enhances the performance of GPT-4V under three psychophysical testing protocols and across nearly all visual attributes. The study concludes that there is still ample room for improving the IQA capabilities of MLLMs, especially in terms of fine-grained quality discrimination and multiple-image quality analysis.This paper presents a comprehensive study of using Multimodal Large Language Models (MLLMs) for Image Quality Assessment (IQA). The authors investigate nine prompting systems for MLLMs, combining three standardized psychophysical testing procedures (single-stimulus, double-stimulus, and multiple-stimulus methods) with three popular prompting strategies in natural language processing (standard, in-context, and chain-of-thought prompting). They also propose a difficult sample selection procedure to challenge MLLMs with diverse and uncertain samples. The study evaluates three open-source and one closed-source MLLMs on several visual attributes of image quality in both full-reference (FR) and no-reference (NR) scenarios. The results show that only the closed-source GPT-4V provides a reasonable account for human perception of image quality, but is weak at discriminating fine-grained quality variations and comparing visual quality of multiple images. The study highlights that different MLLMs require different prompting systems to perform optimally. It also demonstrates that there is still room for improvement in MLLMs for IQA, especially in fine-grained quality discrimination and multiple-image quality analysis. The authors argue that directly fine-tuning open-source MLLMs on datasets with image quality descriptions may not be effective due to the risk of catastrophic forgetting. The paper also discusses the limitations of the current prompting systems and opportunities for future work, including automatic prompt optimization, extending the sampler to large-scale unlabeled image sets, and exploring instruction tuning to enhance IQA performance. The study emphasizes the importance of sample selection when evaluating MLLMs for IQA, due to the high cost of inference. The authors propose a computational procedure for difficult sample selection, taking into account both sample diversity and uncertainty. The results show that the chain-of-thought prompting consistently enhances the performance of GPT-4V under three psychophysical testing protocols and across nearly all visual attributes. The study concludes that there is still ample room for improving the IQA capabilities of MLLMs, especially in terms of fine-grained quality discrimination and multiple-image quality analysis.
Reach us at info@study.space