12 Aug 2024 | Xingyu Fu†‡, Muyu He†‡*, Yujie Lu§*, William Yang Wang§, Dan Roth†
The paper introduces a novel task called Commonsense-T2I, designed to evaluate the ability of text-to-image (T2I) models to produce images that align with real-life commonsense. The task involves evaluating whether T2I models can conduct visual-commonsense reasoning by generating images that fit specific expected outputs, such as "The lightbulb is unlit" or "The lightbulb is lit," based on minor differences in text prompts. The dataset is expert-curated and includes 150 manually curated examples, each with two adversarial text prompts, expected output descriptions, likelihood scores, and commonsense categories. The evaluation metric is designed to assess the alignment between the generated images and the expected outputs, with a focus on the ability to reason across modalities.
The authors benchmark a variety of state-of-the-art T2I models, including DALL-E 3, Stable Diffusion models, Playground v2.5, Openjourney v4, and Flux models. Surprisingly, even the best model, DALL-E 3, achieves only 48.92% accuracy on Commonsense-T2I, indicating a significant gap between current models and human-level intelligence. The paper also explores the effectiveness of GPT-enriched prompts and finds that they do not solve the challenge. Detailed analyses are provided to understand the reasons for the deficiency in current models, including the role of text embedding and the limitations of multimodal large language models (LLMs) in evaluating T2I models.
The main contributions of the paper are threefold: (1) proposing a high-quality expert-annotated benchmark for evaluating commonsense reasoning in T2I models, (2) introducing an automatic evaluation pipeline using multimodal LLMs, and (3) benchmarking a wide range of T2I models on Commonsense-T2I, highlighting the gap between current models and human-level intelligence. The authors aim to foster further research and advancements in real-life image generation by addressing the challenges in commonsense reasoning in T2I models.The paper introduces a novel task called Commonsense-T2I, designed to evaluate the ability of text-to-image (T2I) models to produce images that align with real-life commonsense. The task involves evaluating whether T2I models can conduct visual-commonsense reasoning by generating images that fit specific expected outputs, such as "The lightbulb is unlit" or "The lightbulb is lit," based on minor differences in text prompts. The dataset is expert-curated and includes 150 manually curated examples, each with two adversarial text prompts, expected output descriptions, likelihood scores, and commonsense categories. The evaluation metric is designed to assess the alignment between the generated images and the expected outputs, with a focus on the ability to reason across modalities.
The authors benchmark a variety of state-of-the-art T2I models, including DALL-E 3, Stable Diffusion models, Playground v2.5, Openjourney v4, and Flux models. Surprisingly, even the best model, DALL-E 3, achieves only 48.92% accuracy on Commonsense-T2I, indicating a significant gap between current models and human-level intelligence. The paper also explores the effectiveness of GPT-enriched prompts and finds that they do not solve the challenge. Detailed analyses are provided to understand the reasons for the deficiency in current models, including the role of text embedding and the limitations of multimodal large language models (LLMs) in evaluating T2I models.
The main contributions of the paper are threefold: (1) proposing a high-quality expert-annotated benchmark for evaluating commonsense reasoning in T2I models, (2) introducing an automatic evaluation pipeline using multimodal LLMs, and (3) benchmarking a wide range of T2I models on Commonsense-T2I, highlighting the gap between current models and human-level intelligence. The authors aim to foster further research and advancements in real-life image generation by addressing the challenges in commonsense reasoning in T2I models.