Understanding Evaluating Text-to-Visual Generation with Image-to-Text Generation

This paper introduces VQAScore, a novel metric for evaluating text-to-visual generation models. VQAScore uses a visual-question-answering (VQA) model to compute an alignment score by calculating the probability of a "Yes" answer to a simple question: "Does this figure show {text}?" The method is simple and effective, outperforming prior art including CLIPScore, models trained with extensive human feedback, and divide-and-conquer methods. VQAScore is implemented using off-the-shelf VQA models and achieves state-of-the-art results across multiple image-text alignment benchmarks. Additionally, the paper introduces GenAI-Bench, a more challenging benchmark with 1,600 compositional text prompts that require parsing scenes, objects, attributes, relationships, and high-order reasoning. GenAI-Bench also collects over 15,000 human ratings for leading image and video generation models. The paper also extends VQAScore to evaluate text-to-video and 3D models, showing that it significantly surpasses popular methods such as CLIPScore and PickScore. The authors also propose a new model, CLIP-FlanT5, which outperforms even the strongest baselines that make use of the proprietary GPT-4V. VQAScore is open-sourced, and the authors provide code and models for evaluation. The paper concludes that VQAScore is a strong alternative to CLIPScore for evaluating text-to-visual generation models, especially on real-world compositional text prompts.This paper introduces VQAScore, a novel metric for evaluating text-to-visual generation models. VQAScore uses a visual-question-answering (VQA) model to compute an alignment score by calculating the probability of a "Yes" answer to a simple question: "Does this figure show {text}?" The method is simple and effective, outperforming prior art including CLIPScore, models trained with extensive human feedback, and divide-and-conquer methods. VQAScore is implemented using off-the-shelf VQA models and achieves state-of-the-art results across multiple image-text alignment benchmarks. Additionally, the paper introduces GenAI-Bench, a more challenging benchmark with 1,600 compositional text prompts that require parsing scenes, objects, attributes, relationships, and high-order reasoning. GenAI-Bench also collects over 15,000 human ratings for leading image and video generation models. The paper also extends VQAScore to evaluate text-to-video and 3D models, showing that it significantly surpasses popular methods such as CLIPScore and PickScore. The authors also propose a new model, CLIP-FlanT5, which outperforms even the strongest baselines that make use of the proprietary GPT-4V. VQAScore is open-sourced, and the authors provide code and models for evaluation. The paper concludes that VQAScore is a strong alternative to CLIPScore for evaluating text-to-visual generation models, especially on real-world compositional text prompts.

Evaluating Text-to-Visual Generation with Image-to-Text Generation

2024-06-18 | Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, Deva Ramanan