PROMETHEUS-VISION: Vision-Language Model as a Judge for Fine-Grained Evaluation

PROMETHEUS-VISION: Vision-Language Model as a Judge for Fine-Grained Evaluation

12 Jan 2024 | Seongyun Lee, Seungone Kim, Sue Hyun Park, Geewook Kim, Minjoon Seo
PROMETHEUS-VISION is a vision-language model (VLM) designed to evaluate the quality of long-form responses generated by VLMs. The model is trained on the PERCEPTION COLLECTION, a new multi-modal feedback dataset containing 15,000 fine-grained score rubrics that users might care about during assessment. This dataset includes five input components (image, instruction, response to evaluate, customized score rubric, reference answer) and two output components (language feedback and score decision). Using this dataset, PROMETHEUS-VISION is fine-tuned to generate language feedback and score decisions based on the given criteria. PROMETHEUS-VISION shows the highest Pearson correlation with human evaluators and GPT-4V among open-source models, demonstrating its effectiveness for transparent and accessible evaluation of VLMs. The model is open-sourced, allowing researchers and developers to use it for evaluating VLMs. The model's ability to understand user-defined score criteria makes it a valuable tool for assessing the quality of VLM outputs. The PERCEPTION COLLECTION is constructed through a multi-stage process, starting with the creation of 50 seed score rubrics, followed by the generation of 15,000 fine-grained score rubrics using GPT-4V. The process also involves augmenting instructions and reference answers related to the score rubrics, as well as generating responses and language feedback for training. The dataset is designed to capture the nuances of VLM outputs and provide detailed feedback on the quality of responses. PROMETHEUS-VISION is evaluated on three benchmark datasets: Visual Instruction Following Benchmarks, Visual Question Answering Benchmarks, and Captioning Benchmarks. The model shows high correlation with human evaluators and GPT-4V on these benchmarks, indicating its effectiveness in evaluating VLM outputs. The model's ability to generate detailed language feedback and score decisions based on user-defined criteria makes it a valuable tool for assessing the quality of VLM outputs. The model is also analyzed for potential biases, including length bias and self-enhancement bias. The results show that PROMETHEUS-VISION does not exhibit length bias, as the distribution of response lengths across different scores is even. The model also does not exhibit self-enhancement bias, as it does not prefer its own responses over others. These findings suggest that PROMETHEUS-VISION is a reliable and effective tool for evaluating VLM outputs.PROMETHEUS-VISION is a vision-language model (VLM) designed to evaluate the quality of long-form responses generated by VLMs. The model is trained on the PERCEPTION COLLECTION, a new multi-modal feedback dataset containing 15,000 fine-grained score rubrics that users might care about during assessment. This dataset includes five input components (image, instruction, response to evaluate, customized score rubric, reference answer) and two output components (language feedback and score decision). Using this dataset, PROMETHEUS-VISION is fine-tuned to generate language feedback and score decisions based on the given criteria. PROMETHEUS-VISION shows the highest Pearson correlation with human evaluators and GPT-4V among open-source models, demonstrating its effectiveness for transparent and accessible evaluation of VLMs. The model is open-sourced, allowing researchers and developers to use it for evaluating VLMs. The model's ability to understand user-defined score criteria makes it a valuable tool for assessing the quality of VLM outputs. The PERCEPTION COLLECTION is constructed through a multi-stage process, starting with the creation of 50 seed score rubrics, followed by the generation of 15,000 fine-grained score rubrics using GPT-4V. The process also involves augmenting instructions and reference answers related to the score rubrics, as well as generating responses and language feedback for training. The dataset is designed to capture the nuances of VLM outputs and provide detailed feedback on the quality of responses. PROMETHEUS-VISION is evaluated on three benchmark datasets: Visual Instruction Following Benchmarks, Visual Question Answering Benchmarks, and Captioning Benchmarks. The model shows high correlation with human evaluators and GPT-4V on these benchmarks, indicating its effectiveness in evaluating VLM outputs. The model's ability to generate detailed language feedback and score decisions based on user-defined criteria makes it a valuable tool for assessing the quality of VLM outputs. The model is also analyzed for potential biases, including length bias and self-enhancement bias. The results show that PROMETHEUS-VISION does not exhibit length bias, as the distribution of response lengths across different scores is even. The model also does not exhibit self-enhancement bias, as it does not prefer its own responses over others. These findings suggest that PROMETHEUS-VISION is a reliable and effective tool for evaluating VLM outputs.
Reach us at info@study.space