Towards Open-ended Visual Quality Comparison

Towards Open-ended Visual Quality Comparison

4 Mar 2024 | Haoning Wu*, Hanwei Zhu*, Zicheng Zhang*, Erli Zhang, Chaofeng Chen, Liang Liao, Chunyi Li, Annan Wang, Wenxiu Sun, Qiong Yan, Xiaohong Liu, Guangtao Zhai, Shiqi Wang, and Weisi Lin
This paper introduces Co-Instruct, the first open-source large multi-modal model (LMM) capable of open-ended visual quality comparison. The model is trained on the Co-Instruct-562K dataset, which is constructed using two methods: Merge2Compare, which merges human-labeled single image quality descriptions, and Teach2Compare, which leverages GPT-4V responses on unlabeled data. The model outperforms existing LMMs in both open-ended visual quality comparison tasks and existing benchmarks, achieving 30% higher accuracy than state-of-the-art open-source LMMs and surpassing GPT-4V in multiple benchmarks. The paper also introduces MICBench, the first benchmark for multi-image quality comparison, containing 2,000 multi-choice questions comparing the quality or related attributes among three or four images. The model is trained with an image-text interleaved format to handle multi-image cases and a visual abstractor to reduce visual token length. The results show that Co-Instruct significantly improves the ability of open-source LMMs on multi-image comparison tasks and matches GPT-4V in scenarios requiring detailed language reasonings. The paper also presents ablation studies showing that the model's performance is enhanced by combining data from different sources. Overall, the work advances the field of visual quality comparison by providing a new model and benchmark for open-ended visual quality comparison.This paper introduces Co-Instruct, the first open-source large multi-modal model (LMM) capable of open-ended visual quality comparison. The model is trained on the Co-Instruct-562K dataset, which is constructed using two methods: Merge2Compare, which merges human-labeled single image quality descriptions, and Teach2Compare, which leverages GPT-4V responses on unlabeled data. The model outperforms existing LMMs in both open-ended visual quality comparison tasks and existing benchmarks, achieving 30% higher accuracy than state-of-the-art open-source LMMs and surpassing GPT-4V in multiple benchmarks. The paper also introduces MICBench, the first benchmark for multi-image quality comparison, containing 2,000 multi-choice questions comparing the quality or related attributes among three or four images. The model is trained with an image-text interleaved format to handle multi-image cases and a visual abstractor to reduce visual token length. The results show that Co-Instruct significantly improves the ability of open-source LMMs on multi-image comparison tasks and matches GPT-4V in scenarios requiring detailed language reasonings. The paper also presents ablation studies showing that the model's performance is enhanced by combining data from different sources. Overall, the work advances the field of visual quality comparison by providing a new model and benchmark for open-ended visual quality comparison.
Reach us at info@study.space
Understanding Towards Open-ended Visual Quality Comparison