REVISITING TEXT-TO-IMAGE EVALUATION WITH GECKO: ON METRICS, PROMPTS AND HUMAN RATING

REVISITING TEXT-TO-IMAGE EVALUATION WITH GECKO: ON METRICS, PROMPTS AND HUMAN RATING

17 Mar 2025 | Olivia Wiles, Chuhan Zhang, Isabela Albuquerqe, Ivana Kajić, Su Wang, Emanuele Bugliarello, Yasumasa Onoe, Pinelopi Papalampidi, Ira Ktena, Chris Knutsen, Cyrus Rashtchian, Anant Nawalgaria, Jordi Pont-Tuset, Aida Nematzadeh
The paper introduces Gecko2K, a comprehensive benchmark for evaluating text-to-image (T2I) models and alignment metrics. It addresses the limitations of current evaluation methods by introducing a large-scale dataset with over 100,000 human annotations across four annotation templates. The benchmark includes a curated set of 2,000 prompts (Gecko2K) and evaluates models across three tasks: model ordering, pairwise instance scoring, and point-wise instance scoring. The authors propose a new, interpretable auto-eval metric that outperforms existing ones in terms of correlation with human ratings across different templates and settings. The study highlights the importance of using diverse and comprehensive datasets to evaluate T2I models, as results can vary significantly depending on the prompt set and annotation template. The authors demonstrate that different metrics perform differently across tasks and that a single metric may not be sufficient for all scenarios. They also show that the choice of evaluation task can impact results, and that significant model orderings must be determined through statistical analysis. The paper compares various auto-eval metrics, including CLIPScore, VNLI, and VQAScore, and finds that the proposed Gecko metric performs consistently well across tasks. The authors also extend the benchmark to other modalities, such as text-to-video generation, and show that the Gecko metric aligns closely with human judgments in these settings. The study emphasizes the need for standardized evaluation frameworks that consider prompt sets, annotation templates, and metrics used. It concludes that a comprehensive and diverse evaluation approach is essential for accurately assessing T2I models and alignment metrics. The paper also highlights the importance of using reliable prompts and interpreting results in the context of different evaluation settings.The paper introduces Gecko2K, a comprehensive benchmark for evaluating text-to-image (T2I) models and alignment metrics. It addresses the limitations of current evaluation methods by introducing a large-scale dataset with over 100,000 human annotations across four annotation templates. The benchmark includes a curated set of 2,000 prompts (Gecko2K) and evaluates models across three tasks: model ordering, pairwise instance scoring, and point-wise instance scoring. The authors propose a new, interpretable auto-eval metric that outperforms existing ones in terms of correlation with human ratings across different templates and settings. The study highlights the importance of using diverse and comprehensive datasets to evaluate T2I models, as results can vary significantly depending on the prompt set and annotation template. The authors demonstrate that different metrics perform differently across tasks and that a single metric may not be sufficient for all scenarios. They also show that the choice of evaluation task can impact results, and that significant model orderings must be determined through statistical analysis. The paper compares various auto-eval metrics, including CLIPScore, VNLI, and VQAScore, and finds that the proposed Gecko metric performs consistently well across tasks. The authors also extend the benchmark to other modalities, such as text-to-video generation, and show that the Gecko metric aligns closely with human judgments in these settings. The study emphasizes the need for standardized evaluation frameworks that consider prompt sets, annotation templates, and metrics used. It concludes that a comprehensive and diverse evaluation approach is essential for accurately assessing T2I models and alignment metrics. The paper also highlights the importance of using reliable prompts and interpreting results in the context of different evaluation settings.
Reach us at info@study.space
[slides and audio] Revisiting Text-to-Image Evaluation with Gecko%3A On Metrics%2C Prompts%2C and Human Ratings