17 Mar 2025 | Olivia Wiles*,† Chuhan Zhang*,† Isabela Albuquerque*,† Ivana KajiㆠSu Wang† Emanuele Bugliarello† Yasumasa Onoe† Pinelopi Papalampidi† Ira Ktena† Chris Knutsen† Cyrus Rashtchian‡ Anant Nawalgaria§ Jordi Pont-Tuset† Aida Nematzadeh†
The paper "Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts and Human Rating" by Olivia Wiles et al. addresses the issue of evaluating text-to-image (T2I) generative models and their alignment with prompts. The authors find that current evaluation methods, which often focus on a single set of prompts or human annotations, do not provide stable and generalizable conclusions. To address this, they introduce the Gecko evaluation suite, which includes over 100K annotations across four human annotation templates and a comprehensive set of 2K prompts. This suite allows for a more systematic and reliable evaluation of T2I models and alignment metrics.
The paper highlights the importance of considering different prompt sets and human annotation templates to ensure that evaluations are not biased by specific data slices. It also introduces a new interpretable auto-eval metric, Gecko, which outperforms existing metrics in terms of correlation with human ratings across various evaluation tasks, including model ordering, pair-wise instance scoring, and point-wise instance scoring.
Key contributions of the paper include:
1. **Gecko Evaluation Suite**: A comprehensive set of prompts and human annotations to evaluate T2I models and alignment metrics.
2. **Interpretable Auto-Eval Metric (Gecko)**: A metric that consistently correlates better with human ratings and is more reliable across different evaluation tasks.
3. **Systematic Evaluation**: The authors demonstrate that different metrics and models perform differently depending on the prompt set or human annotation template, emphasizing the need for a standardized and comprehensive evaluation framework.
The paper concludes by highlighting the importance of standardizing the evaluation framework and provides tools for developers and practitioners to better understand and evaluate T2I models.The paper "Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts and Human Rating" by Olivia Wiles et al. addresses the issue of evaluating text-to-image (T2I) generative models and their alignment with prompts. The authors find that current evaluation methods, which often focus on a single set of prompts or human annotations, do not provide stable and generalizable conclusions. To address this, they introduce the Gecko evaluation suite, which includes over 100K annotations across four human annotation templates and a comprehensive set of 2K prompts. This suite allows for a more systematic and reliable evaluation of T2I models and alignment metrics.
The paper highlights the importance of considering different prompt sets and human annotation templates to ensure that evaluations are not biased by specific data slices. It also introduces a new interpretable auto-eval metric, Gecko, which outperforms existing metrics in terms of correlation with human ratings across various evaluation tasks, including model ordering, pair-wise instance scoring, and point-wise instance scoring.
Key contributions of the paper include:
1. **Gecko Evaluation Suite**: A comprehensive set of prompts and human annotations to evaluate T2I models and alignment metrics.
2. **Interpretable Auto-Eval Metric (Gecko)**: A metric that consistently correlates better with human ratings and is more reliable across different evaluation tasks.
3. **Systematic Evaluation**: The authors demonstrate that different metrics and models perform differently depending on the prompt set or human annotation template, emphasizing the need for a standardized and comprehensive evaluation framework.
The paper concludes by highlighting the importance of standardizing the evaluation framework and provides tools for developers and practitioners to better understand and evaluate T2I models.