A-Bench: Are LMMs Masters at Evaluating AI-generated Images?

A-Bench: Are LMMs Masters at Evaluating AI-generated Images?

5 Jun 2024 | Zicheng Zhang, Haoning Wu, Chunyi Li, Yingjie Zhou, Wei Sun, Xiongkuo Min, Zijian Chen, Xiaohong Liu, Weisi Lin, Guangtao Zhai
A-Bench: Are LMMs Masters at Evaluating AI-generated Images? This paper introduces A-Bench, a benchmark designed to assess whether large multi-modal models (LMMs) are capable of accurately evaluating AI-generated images (AIGIs). The benchmark is structured around two key principles: 1) Emphasizing both high-level semantic understanding and low-level visual quality perception to address the complex demands of AIGIs. 2) Utilizing various generative models for AIGI creation and various LMMs for evaluation, ensuring a comprehensive validation scope. A-Bench includes 2,864 AIGIs from 16 text-to-image models, each paired with question-answer sets annotated by human experts, and tested across 18 leading LMMs. The results show that LMMs significantly lag behind even the poorest human performance, indicating that they are not yet robust for different evaluation scenarios for AIGIs. The benchmark highlights the need for further improvements in LMMs' capabilities in semantic understanding and quality assessment. A-Bench provides a detailed diagnostic framework for evaluating LMMs in AIGI evaluation, focusing on high-level semantic understanding and low-level quality perception. The benchmark reveals that LMMs are not yet masters at evaluating AIGIs, as they struggle with nuanced semantic understanding and quality perception. The results underscore the importance of developing more accurate and reliable evaluation methods for AIGIs.A-Bench: Are LMMs Masters at Evaluating AI-generated Images? This paper introduces A-Bench, a benchmark designed to assess whether large multi-modal models (LMMs) are capable of accurately evaluating AI-generated images (AIGIs). The benchmark is structured around two key principles: 1) Emphasizing both high-level semantic understanding and low-level visual quality perception to address the complex demands of AIGIs. 2) Utilizing various generative models for AIGI creation and various LMMs for evaluation, ensuring a comprehensive validation scope. A-Bench includes 2,864 AIGIs from 16 text-to-image models, each paired with question-answer sets annotated by human experts, and tested across 18 leading LMMs. The results show that LMMs significantly lag behind even the poorest human performance, indicating that they are not yet robust for different evaluation scenarios for AIGIs. The benchmark highlights the need for further improvements in LMMs' capabilities in semantic understanding and quality assessment. A-Bench provides a detailed diagnostic framework for evaluating LMMs in AIGI evaluation, focusing on high-level semantic understanding and low-level quality perception. The benchmark reveals that LMMs are not yet masters at evaluating AIGIs, as they struggle with nuanced semantic understanding and quality perception. The results underscore the importance of developing more accurate and reliable evaluation methods for AIGIs.
Reach us at info@study.space