Learning Multi-dimensional Human Preference for Text-to-Image Generation

Learning Multi-dimensional Human Preference for Text-to-Image Generation

23 May 2024 | Sixian Zhang, Bohan Wang, Junqiang Wu, Yan Li, Tingting Gao, Di Zhang, Zhongyuan Wang
This paper introduces the Multi-dimensional Human Preference (MHP) dataset and the Multi-dimensional Preference Score (MPS) model for evaluating text-to-image generation. Current metrics for text-to-image models rely on statistical measures that inadequately represent human preferences. Existing methods often reduce complex human preferences to a single score, failing to capture the multidimensionality of human preferences. To address this, the MHP dataset is proposed, containing 918,315 human preference choices across four dimensions (aesthetics, semantic alignment, detail quality, and overall assessment) on 607,541 images generated by various text-to-image models. The MPS model is designed to learn and predict multi-dimensional human preferences by incorporating a condition mask that highlights relevant prompt words for each preference condition. The model uses a pre-trained vision-language model (e.g., CLIP) to extract features from images and prompts, then fuses these features using a cross-attention mechanism. The condition mask is applied to focus on relevant prompt words, enabling the model to predict preference scores under different conditions. The MPS model outperforms existing methods across three datasets in predicting both overall and multi-dimensional preferences, demonstrating strong generalization. The MHP dataset and MPS model are made publicly available to facilitate future research. The MPS model provides a comprehensive evaluation of text-to-image generation by aligning with multi-dimensional human preferences, offering a more accurate and effective metric for evaluating and improving text-to-image models.This paper introduces the Multi-dimensional Human Preference (MHP) dataset and the Multi-dimensional Preference Score (MPS) model for evaluating text-to-image generation. Current metrics for text-to-image models rely on statistical measures that inadequately represent human preferences. Existing methods often reduce complex human preferences to a single score, failing to capture the multidimensionality of human preferences. To address this, the MHP dataset is proposed, containing 918,315 human preference choices across four dimensions (aesthetics, semantic alignment, detail quality, and overall assessment) on 607,541 images generated by various text-to-image models. The MPS model is designed to learn and predict multi-dimensional human preferences by incorporating a condition mask that highlights relevant prompt words for each preference condition. The model uses a pre-trained vision-language model (e.g., CLIP) to extract features from images and prompts, then fuses these features using a cross-attention mechanism. The condition mask is applied to focus on relevant prompt words, enabling the model to predict preference scores under different conditions. The MPS model outperforms existing methods across three datasets in predicting both overall and multi-dimensional preferences, demonstrating strong generalization. The MHP dataset and MPS model are made publicly available to facilitate future research. The MPS model provides a comprehensive evaluation of text-to-image generation by aligning with multi-dimensional human preferences, offering a more accurate and effective metric for evaluating and improving text-to-image models.
Reach us at info@study.space
[slides] Learning Multi-Dimensional Human Preference for Text-to-Image Generation | StudySpace