11 Jul 2024 | Tianhe Wu, Kede Ma, Jie Liang, Yujiu Yang, Lei Zhang
This paper conducts a comprehensive and systematic study on using Multimodal Large Language Models (MLLMs) for Image Quality Assessment (IQA). The authors investigate nine prompting systems for MLLMs, combining standardized psychophysical testing procedures (single-stimulus, double-stimulus, and multiple-stimulus methods) with popular NLP prompting strategies (standard, in-context, and chain-of-thought prompting). They also propose a computational procedure to select difficult samples using top-performing IQA expert models as proxies, considering sample diversity and uncertainty. The study evaluates three open-source and one closed-source MLLM on several visual attributes of image quality, including structural and textural distortions, geometric transformations, and color differences, in both full-reference (FR) and no-reference (NR) scenarios.
The results show that different MLLMs require different prompting systems to perform optimally. Only the closed-source GPT-4V provides a reasonable account of human perception of image quality but struggles with fine-grained quality variations and multiple-image quality analysis. The study highlights the need for re-evaluating recent progress in MLLMs for IQA and suggests that fine-tuning open-source MLLMs on existing datasets may not be effective due to the risk of catastrophic forgetting. The findings also emphasize the importance of sample selection when evaluating MLLMs for IQA, proposing a computational procedure to efficiently identify informative testing samples.This paper conducts a comprehensive and systematic study on using Multimodal Large Language Models (MLLMs) for Image Quality Assessment (IQA). The authors investigate nine prompting systems for MLLMs, combining standardized psychophysical testing procedures (single-stimulus, double-stimulus, and multiple-stimulus methods) with popular NLP prompting strategies (standard, in-context, and chain-of-thought prompting). They also propose a computational procedure to select difficult samples using top-performing IQA expert models as proxies, considering sample diversity and uncertainty. The study evaluates three open-source and one closed-source MLLM on several visual attributes of image quality, including structural and textural distortions, geometric transformations, and color differences, in both full-reference (FR) and no-reference (NR) scenarios.
The results show that different MLLMs require different prompting systems to perform optimally. Only the closed-source GPT-4V provides a reasonable account of human perception of image quality but struggles with fine-grained quality variations and multiple-image quality analysis. The study highlights the need for re-evaluating recent progress in MLLMs for IQA and suggests that fine-tuning open-source MLLMs on existing datasets may not be effective due to the risk of catastrophic forgetting. The findings also emphasize the importance of sample selection when evaluating MLLMs for IQA, proposing a computational procedure to efficiently identify informative testing samples.