This paper investigates the image quality assessment (IQA) capabilities of large multimodal models (LMMs) using the two-alternative forced choice (2AFC) prompting method. The 2AFC method is considered the most reliable way to collect human opinions on visual quality. The study evaluates five LMMs—IDEFICS-Instruct, mPLUG-Owl, XComposer-VL, Q-Instruct, and GPT-4V—using three evaluation criteria: consistency, accuracy, and correlation. The results show that existing LMMs perform well in coarse-grained quality comparison but struggle with fine-grained quality discrimination. GPT-4V, a proprietary model, outperforms others in both coarse and fine-grained IQA tasks. The study introduces a new dataset for evaluating LMMs in IQA and proposes a method to aggregate pairwise preferences using maximum a posteriori (MAP) estimation. The results indicate that while LMMs can achieve reasonable performance in IQA, there is still room for improvement, particularly in fine-grained quality discrimination. The study also compares different global ranking aggregation methods and finds that MAP estimation performs better than other methods. The findings suggest that further research is needed to improve the IQA capabilities of LMMs.This paper investigates the image quality assessment (IQA) capabilities of large multimodal models (LMMs) using the two-alternative forced choice (2AFC) prompting method. The 2AFC method is considered the most reliable way to collect human opinions on visual quality. The study evaluates five LMMs—IDEFICS-Instruct, mPLUG-Owl, XComposer-VL, Q-Instruct, and GPT-4V—using three evaluation criteria: consistency, accuracy, and correlation. The results show that existing LMMs perform well in coarse-grained quality comparison but struggle with fine-grained quality discrimination. GPT-4V, a proprietary model, outperforms others in both coarse and fine-grained IQA tasks. The study introduces a new dataset for evaluating LMMs in IQA and proposes a method to aggregate pairwise preferences using maximum a posteriori (MAP) estimation. The results indicate that while LMMs can achieve reasonable performance in IQA, there is still room for improvement, particularly in fine-grained quality discrimination. The study also compares different global ranking aggregation methods and finds that MAP estimation performs better than other methods. The findings suggest that further research is needed to improve the IQA capabilities of LMMs.