5 Jul 2024 | Zhaorun Chen, Yichao Du, Zichen Wen, Yiyang Zhou, Chenhang Cui, Zhenzhen Weng, Haoqin Tu, Chaoqi Wang, Zhengwei Tong, Qinglan Huang, Canyu Chen, Qinghao Ye, Zhihong Zhu, Yuqing Zhang, Jiawei Zhou, Zhuokai Zhao, Rafael Rafailov, Chelsea Finn, Huaxiu Yao
**MJ-BENCH: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?**
This paper addresses the challenges faced by text-to-image models, such as hallucination, bias, and the production of unsafe, low-quality output. To improve these issues, the authors introduce MJ-BENCH, a novel benchmark that evaluates multimodal judges in providing feedback for image generation models across four key perspectives: alignment, safety, image quality, and bias. The benchmark includes a comprehensive preference dataset and evaluates a wide range of multimodal judges, including CLIP-based scoring models, open-source VLMs, and close-source VLMs.
Key findings from the evaluation include:
1. **Close-source VLMs** generally provide better feedback, with GPT-4o outperforming other judges on average.
2. **Smaller-sized scoring models** are better at providing feedback on text-image alignment and image quality.
3. **VLMs** are more accurate in providing feedback on safety and generation bias due to their stronger reasoning capabilities.
4. **VLM judges** generally provide more accurate and stable feedback in natural language (Likert-scale) compared to numerical scales.
5. Human evaluations on end-to-end fine-tuned models using separate feedback from these multimodal judges confirm the effectiveness of MJ-BENCH.
The paper also discusses the design philosophy and construction of the MJ-BENCH dataset, which includes detailed curation processes for each perspective (alignment, safety, quality, and bias). The evaluation metrics used to assess the judges' performance are explained, and the results are presented in tables and figures. The authors conclude by highlighting the importance of understanding the capabilities and limitations of multimodal judges to improve the reliability and alignment of text-to-image generation models.**MJ-BENCH: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?**
This paper addresses the challenges faced by text-to-image models, such as hallucination, bias, and the production of unsafe, low-quality output. To improve these issues, the authors introduce MJ-BENCH, a novel benchmark that evaluates multimodal judges in providing feedback for image generation models across four key perspectives: alignment, safety, image quality, and bias. The benchmark includes a comprehensive preference dataset and evaluates a wide range of multimodal judges, including CLIP-based scoring models, open-source VLMs, and close-source VLMs.
Key findings from the evaluation include:
1. **Close-source VLMs** generally provide better feedback, with GPT-4o outperforming other judges on average.
2. **Smaller-sized scoring models** are better at providing feedback on text-image alignment and image quality.
3. **VLMs** are more accurate in providing feedback on safety and generation bias due to their stronger reasoning capabilities.
4. **VLM judges** generally provide more accurate and stable feedback in natural language (Likert-scale) compared to numerical scales.
5. Human evaluations on end-to-end fine-tuned models using separate feedback from these multimodal judges confirm the effectiveness of MJ-BENCH.
The paper also discusses the design philosophy and construction of the MJ-BENCH dataset, which includes detailed curation processes for each perspective (alignment, safety, quality, and bias). The evaluation metrics used to assess the judges' performance are explained, and the results are presented in tables and figures. The authors conclude by highlighting the importance of understanding the capabilities and limitations of multimodal judges to improve the reliability and alignment of text-to-image generation models.