21 Jun 2024 | Baiqi Li1*, Zhiqiu Lin1,2*, Deepak Pathak1, Jiayao Li1, Yixin Fei1, Kewen Wu1, Tiffany Ling1, Xide Xia2†, Pengchuan Zhang2†, Graham Neubig1†, Deva Ramanan1†
GenAI-Bench is a comprehensive benchmark designed to evaluate and improve the performance of text-to-visual models in handling compositional text prompts, which involve attributes, relationships, and higher-order reasoning. The benchmark includes 1,600 challenging real-world text prompts sourced from professional designers, covering both basic and advanced compositional skills. Human ratings are collected using a Likert scale to assess the alignment between generated visuals and the input text prompts. The study reveals that while state-of-the-art models like DALL-E 3 and Stable Diffusion perform well on basic prompts, they struggle with advanced reasoning tasks.
The paper introduces VQAScore, an automated evaluation metric that measures the likelihood of a VQA model accurately depicting an image based on a given text prompt. VQAScore outperforms previous metrics such as CLIPScore, PickScore, and Davidsonian, showing stronger correlation with human judgments. VQAScore can be calculated end-to-end from off-the-shelf VQA models without fine-tuning, making it a black-box approach. The study demonstrates that ranking candidate images using VQAScore significantly improves the alignment of generated images with the prompts, outperforming other scoring methods by 2x to 3x.
The paper also discusses the limitations of VQAScore, particularly in handling fine-grained visual details and linguistic ambiguity, and suggests future improvements. A new benchmark, GenAI-Rank, is proposed to evaluate methods that rank images generated from the same prompt, with over 40,000 human ratings to be released for reproducibility. The paper concludes by emphasizing the importance of automated evaluation metrics and the need for further research to address Goodhart's Law, where an over-optimized metric loses its effectiveness.GenAI-Bench is a comprehensive benchmark designed to evaluate and improve the performance of text-to-visual models in handling compositional text prompts, which involve attributes, relationships, and higher-order reasoning. The benchmark includes 1,600 challenging real-world text prompts sourced from professional designers, covering both basic and advanced compositional skills. Human ratings are collected using a Likert scale to assess the alignment between generated visuals and the input text prompts. The study reveals that while state-of-the-art models like DALL-E 3 and Stable Diffusion perform well on basic prompts, they struggle with advanced reasoning tasks.
The paper introduces VQAScore, an automated evaluation metric that measures the likelihood of a VQA model accurately depicting an image based on a given text prompt. VQAScore outperforms previous metrics such as CLIPScore, PickScore, and Davidsonian, showing stronger correlation with human judgments. VQAScore can be calculated end-to-end from off-the-shelf VQA models without fine-tuning, making it a black-box approach. The study demonstrates that ranking candidate images using VQAScore significantly improves the alignment of generated images with the prompts, outperforming other scoring methods by 2x to 3x.
The paper also discusses the limitations of VQAScore, particularly in handling fine-grained visual details and linguistic ambiguity, and suggests future improvements. A new benchmark, GenAI-Rank, is proposed to evaluate methods that rank images generated from the same prompt, with over 40,000 human ratings to be released for reproducibility. The paper concludes by emphasizing the importance of automated evaluation metrics and the need for further research to address Goodhart's Law, where an over-optimized metric loses its effectiveness.