Understanding Towards Open-ended Visual Quality Comparison

This paper addresses the challenge of open-ended visual quality comparison, aiming to develop a model that can respond to open-range questions and provide detailed reasoning on the quality of multiple images. The authors propose Co-Instruct, a novel large multi-modality model (LMM) designed for this purpose. To train Co-Instruct, they collect the Co-Instruct-562K dataset, which combines two sources: (1) LLM-merged single image quality descriptions, and (2) GPT-4V "teacher" responses on unlabeled data. This dataset is the first of its kind for open-ended visual quality comparison. The authors also introduce MICBench, a benchmark specifically designed for evaluating LMMs on multi-image quality comparison tasks. Co-Instruct outperforms existing LMMs on both the proposed MICBench and existing quality evaluation benchmarks, achieving up to 30% higher accuracy than state-of-the-art models and surpassing GPT-4V in various multi-choice question (MCQ) benchmarks. The paper contributes to the field by advancing the capabilities of LMMs in open-ended visual quality comparison and providing a comprehensive benchmark for future research.This paper addresses the challenge of open-ended visual quality comparison, aiming to develop a model that can respond to open-range questions and provide detailed reasoning on the quality of multiple images. The authors propose Co-Instruct, a novel large multi-modality model (LMM) designed for this purpose. To train Co-Instruct, they collect the Co-Instruct-562K dataset, which combines two sources: (1) LLM-merged single image quality descriptions, and (2) GPT-4V "teacher" responses on unlabeled data. This dataset is the first of its kind for open-ended visual quality comparison. The authors also introduce MICBench, a benchmark specifically designed for evaluating LMMs on multi-image quality comparison tasks. Co-Instruct outperforms existing LMMs on both the proposed MICBench and existing quality evaluation benchmarks, achieving up to 30% higher accuracy than state-of-the-art models and surpassing GPT-4V in various multi-choice question (MCQ) benchmarks. The paper contributes to the field by advancing the capabilities of LMMs in open-ended visual quality comparison and providing a comprehensive benchmark for future research.

Towards Open-ended Visual Quality Comparison

4 Mar 2024 | Haoning Wu*1, Hanwei Zhu*2, Zicheng Zhang*3, Erli Zhang1 Chaofeng Chen1, Liang Liao3, Chunyi Li3, Annan Wang1, Wenxiu Sun4 Qiong Yan4, Xiaohong Liu3, Guangtao Zhai3, Shiqi Wang2, and Weisi Lin1

4 Mar 2024 | Haoning Wu1, Hanwei Zhu2, Zicheng Zhang*3, Erli Zhang1 Chaofeng Chen1, Liang Liao3, Chunyi Li3, Annan Wang1, Wenxiu Sun4 Qiong Yan4, Xiaohong Liu3, Guangtao Zhai3, Shiqi Wang2, and Weisi Lin1