Understanding DreamBench%2B%2B%3A A Human-Aligned Benchmark for Personalized Image Generation

DREAMBENCH++ is a human-aligned benchmark for personalized image generation, designed to evaluate the two key capabilities of personalized image generation models: concept preservation and prompt following. It is automated using advanced multimodal GPT models, such as GPT-4o, and includes a diverse dataset of 150 images and 1,350 prompts, significantly larger and more varied than existing benchmarks like DreamBench. The benchmark systematically designs prompts to ensure both human and self-alignment, and uses GPT-4o for automated evaluation. DREAMBENCH++ demonstrates significantly higher alignment with human evaluations compared to traditional metrics like DINO and CLIP, achieving 79.64% and 93.18% agreement in concept preservation and prompt following, respectively. The benchmark also includes a comprehensive dataset with diverse images covering various difficulty levels, including animals, styles, and human subjects. The benchmark evaluates 7 modern generative models, showing that DreamBooth performs best in overall performance, preserving detailed visual features while adhering closely to text prompts. DREAMBENCH++ provides a human-aligned and automated evaluation framework, which is robust and transferable to other domains and foundation models. The benchmark is open-sourced to promote innovation in the research community. The results show that DREAMBENCH++ provides more accurate and comprehensive evaluations of personalized image generation models, and that the evaluation metrics are more aligned with human preferences than traditional metrics. The benchmark also highlights the strengths and weaknesses of different models in preserving concepts and following prompts, and provides insights into prompt design for advanced multimodal GPTs. The benchmark is a valuable resource for researchers in the field of personalized image generation.DREAMBENCH++ is a human-aligned benchmark for personalized image generation, designed to evaluate the two key capabilities of personalized image generation models: concept preservation and prompt following. It is automated using advanced multimodal GPT models, such as GPT-4o, and includes a diverse dataset of 150 images and 1,350 prompts, significantly larger and more varied than existing benchmarks like DreamBench. The benchmark systematically designs prompts to ensure both human and self-alignment, and uses GPT-4o for automated evaluation. DREAMBENCH++ demonstrates significantly higher alignment with human evaluations compared to traditional metrics like DINO and CLIP, achieving 79.64% and 93.18% agreement in concept preservation and prompt following, respectively. The benchmark also includes a comprehensive dataset with diverse images covering various difficulty levels, including animals, styles, and human subjects. The benchmark evaluates 7 modern generative models, showing that DreamBooth performs best in overall performance, preserving detailed visual features while adhering closely to text prompts. DREAMBENCH++ provides a human-aligned and automated evaluation framework, which is robust and transferable to other domains and foundation models. The benchmark is open-sourced to promote innovation in the research community. The results show that DREAMBENCH++ provides more accurate and comprehensive evaluations of personalized image generation models, and that the evaluation metrics are more aligned with human preferences than traditional metrics. The benchmark also highlights the strengths and weaknesses of different models in preserving concepts and following prompts, and provides insights into prompt design for advanced multimodal GPTs. The benchmark is a valuable resource for researchers in the field of personalized image generation.

DREAMBENCH++: A Human-Aligned Benchmark for Personalized Image Generation

24 Jun 2024 | Yuang Peng, Yuxin Cui, Haomiao Tang, Zekun Qi, Runpei Dong, Jing Bai, Chunru Han, Zheng Ge, Xiangyu Zhang, Shu-Tao Xia