24 Aug 2024 | Xingyu Fu, Muyu He, YuJie Lu, William Yang Wang, Dan Roth
The Commonsense-T2I challenge evaluates whether text-to-image (T2I) generation models can produce images that align with real-world commonsense. The task involves two adversarial prompts with similar action words but different outcomes, such as "a lightbulb without electricity" vs. "a lightbulb with electricity," and requires models to generate images that reflect the correct outcome, e.g., "The lightbulb is unlit" vs. "The lightbulb is lit." The dataset is carefully curated by experts and annotated with fine-grained labels, including commonsense type and likelihood of expected outputs. The benchmark includes 150 examples, each with two prompts, expected outputs, and likelihood scores.
The evaluation metric requires both generated images to match their respective expected outputs for a sample to be considered correct. Results show that even the state-of-the-art DALL-E 3 model achieves only 48.92% accuracy, while Stable Diffusion XL achieves 24.92%. The findings indicate that current T2I models lack the ability to reason about commonsense, and GPT-enriched prompts do not solve the challenge. The study proposes an automatic evaluation pipeline using multimodal large language models (LLMs) and shows that it aligns well with human evaluations.
The results highlight a significant gap between current T2I models and human-level commonsense reasoning. The Commonsense-T2I benchmark aims to provide a high-quality evaluation tool for assessing T2I models' ability to generate images that align with real-world commonsense. The study also includes detailed analyses of error cases and the limitations of current models, emphasizing the need for further research to improve T2I models' commonsense reasoning capabilities.The Commonsense-T2I challenge evaluates whether text-to-image (T2I) generation models can produce images that align with real-world commonsense. The task involves two adversarial prompts with similar action words but different outcomes, such as "a lightbulb without electricity" vs. "a lightbulb with electricity," and requires models to generate images that reflect the correct outcome, e.g., "The lightbulb is unlit" vs. "The lightbulb is lit." The dataset is carefully curated by experts and annotated with fine-grained labels, including commonsense type and likelihood of expected outputs. The benchmark includes 150 examples, each with two prompts, expected outputs, and likelihood scores.
The evaluation metric requires both generated images to match their respective expected outputs for a sample to be considered correct. Results show that even the state-of-the-art DALL-E 3 model achieves only 48.92% accuracy, while Stable Diffusion XL achieves 24.92%. The findings indicate that current T2I models lack the ability to reason about commonsense, and GPT-enriched prompts do not solve the challenge. The study proposes an automatic evaluation pipeline using multimodal large language models (LLMs) and shows that it aligns well with human evaluations.
The results highlight a significant gap between current T2I models and human-level commonsense reasoning. The Commonsense-T2I benchmark aims to provide a high-quality evaluation tool for assessing T2I models' ability to generate images that align with real-world commonsense. The study also includes detailed analyses of error cases and the limitations of current models, emphasizing the need for further research to improve T2I models' commonsense reasoning capabilities.