21 Jun 2024 | Baiqi Li, Zhiqiu Lin, Deepak Pathak, Jiayao Li, Yixin Fei, Kewen Wu, Tiffany Ling, Xide Xia, Pengchuan Zhang, Graham Neubig, Deva Ramanan
GenAI-Bench evaluates and improves compositional text-to-visual generation. The paper introduces VQAScore, a metric that measures the likelihood that a VQA model views an image as accurately depicting the prompt. VQAScore outperforms existing metrics like CLIPScore and significantly improves generation by ranking candidate images without fine-tuning. The authors conducted an extensive human study using GenAI-Bench, a benchmark with 1,600 real-world text prompts from professional designers. They collected over 80,000 human ratings to evaluate scoring metrics on image generation. VQAScore is shown to be more effective than other metrics in improving human alignment ratings for DALL-E 3 and Stable Diffusion, especially on compositional prompts. The authors also release a new GenAI-Rank benchmark with over 40,000 human ratings to evaluate ranking methods. They discuss limitations of VQAScore, such as its inability to detect fine-grained visual details and resolve linguistic ambiguity. Despite these limitations, they recommend VQAScore as a reliable alternative to CLIPScore. The paper concludes that VQAScore is a promising tool for evaluating and improving text-to-visual generation.GenAI-Bench evaluates and improves compositional text-to-visual generation. The paper introduces VQAScore, a metric that measures the likelihood that a VQA model views an image as accurately depicting the prompt. VQAScore outperforms existing metrics like CLIPScore and significantly improves generation by ranking candidate images without fine-tuning. The authors conducted an extensive human study using GenAI-Bench, a benchmark with 1,600 real-world text prompts from professional designers. They collected over 80,000 human ratings to evaluate scoring metrics on image generation. VQAScore is shown to be more effective than other metrics in improving human alignment ratings for DALL-E 3 and Stable Diffusion, especially on compositional prompts. The authors also release a new GenAI-Rank benchmark with over 40,000 human ratings to evaluate ranking methods. They discuss limitations of VQAScore, such as its inability to detect fine-grained visual details and resolve linguistic ambiguity. Despite these limitations, they recommend VQAScore as a reliable alternative to CLIPScore. The paper concludes that VQAScore is a promising tool for evaluating and improving text-to-visual generation.