23 May 2024 | Sixian Zhang, Bohan Wang, Junqiang Wu, Yan Li, Tingting Gao, Di Zhang, Zhongyuan Wang
The paper "Learning Multi-dimensional Human Preference for Text-to-Image Generation" by Sixian Zhang et al. addresses the limitations of current metrics for evaluating text-to-image models, which often fail to capture the nuanced and multi-dimensional preferences of humans. The authors propose the Multi-dimensional Preference Score (MPS), a novel model designed to learn and evaluate these preferences. The MPS is trained on the Multi-dimensional Human Preference (MHP) dataset, which includes 918,315 human preference choices across four dimensions: aesthetics, semantic alignment, detail quality, and overall assessment. The MHP dataset is the largest of its kind, comprising 607,541 images generated by various text-to-image models. The MPS model uses a CLIP model and introduces a preference condition module to learn diverse preferences, incorporating a condition mask that ensures the model focuses on relevant parts of the prompt and image. Experimental results show that the MPS outperforms existing methods in predicting multi-dimensional human preferences across three datasets, demonstrating its effectiveness and generalization capabilities. The authors also introduce the MPS benchmark, which can be used to evaluate text-to-image models based on multiple dimensions of human preference.The paper "Learning Multi-dimensional Human Preference for Text-to-Image Generation" by Sixian Zhang et al. addresses the limitations of current metrics for evaluating text-to-image models, which often fail to capture the nuanced and multi-dimensional preferences of humans. The authors propose the Multi-dimensional Preference Score (MPS), a novel model designed to learn and evaluate these preferences. The MPS is trained on the Multi-dimensional Human Preference (MHP) dataset, which includes 918,315 human preference choices across four dimensions: aesthetics, semantic alignment, detail quality, and overall assessment. The MHP dataset is the largest of its kind, comprising 607,541 images generated by various text-to-image models. The MPS model uses a CLIP model and introduces a preference condition module to learn diverse preferences, incorporating a condition mask that ensures the model focuses on relevant parts of the prompt and image. Experimental results show that the MPS outperforms existing methods in predicting multi-dimensional human preferences across three datasets, demonstrating its effectiveness and generalization capabilities. The authors also introduce the MPS benchmark, which can be used to evaluate text-to-image models based on multiple dimensions of human preference.