This paper explores the image quality assessment (IQA) capabilities of large multimodal models (LMMs) using the two-alternative forced choice (2AFC) method, which is considered the most reliable way to collect human opinions on visual quality. The authors introduce three evaluation criteria—consistency, accuracy, and correlation—to comprehensively assess the IQA performance of five LMMs. Extensive experiments on existing image quality datasets reveal that while LMMs generally struggle with IQA tasks, particularly in fine-grained quality discrimination, the proprietary model GPT-4V shows outstanding performance. The proposed dataset and methods will facilitate future research in developing more advanced LMMs for IQA tasks. The paper also includes a detailed methodology for coarse-to-fine pairing rules, maximum a posterior estimation, and evaluation criteria, along with experimental setups and results.This paper explores the image quality assessment (IQA) capabilities of large multimodal models (LMMs) using the two-alternative forced choice (2AFC) method, which is considered the most reliable way to collect human opinions on visual quality. The authors introduce three evaluation criteria—consistency, accuracy, and correlation—to comprehensively assess the IQA performance of five LMMs. Extensive experiments on existing image quality datasets reveal that while LMMs generally struggle with IQA tasks, particularly in fine-grained quality discrimination, the proprietary model GPT-4V shows outstanding performance. The proposed dataset and methods will facilitate future research in developing more advanced LMMs for IQA tasks. The paper also includes a detailed methodology for coarse-to-fine pairing rules, maximum a posterior estimation, and evaluation criteria, along with experimental setups and results.