PROMETHEUS-VISION: Vision-Language Model as a Judge for Fine-Grained Evaluation

PROMETHEUS-VISION: Vision-Language Model as a Judge for Fine-Grained Evaluation

12 Jan 2024 | Seongyun Lee, Seungone Kim, Sue Hyun Park, Geewook Kim, Minjoon Seo
The paper introduces PROMETHEUS-VISION, an open-source Vision-Language Model (VLM) evaluator designed to assess the quality of long-form responses generated by VLMs. The evaluation of VLMs is challenging because it requires checking not only whether the model follows the given instruction but also verifying if the text output is grounded in the provided image. To address this, the authors propose using VLMs to evaluate other VLMs, inspired by the 'LM-as-a-Judge' paradigm. They create a new feedback dataset called PERCEPTION COLLECTION, which includes 15K customized score rubrics that users might consider during assessment. Using this dataset, they fine-tune LLaVA-1.5 to create PROMETHEUS-VISION, a VLM evaluator that can understand user-defined score criteria. PROMETHEUS-VISION shows high Pearson correlation with human evaluators and GPT-4V, demonstrating its effectiveness for transparent and accessible evaluation of VLMs. The authors also open-source their code, dataset, and model at <https://github.com/kaistAI/prometheus-vision>. The paper discusses the limitations of PROMETHEUS-VISION, such as its performance with text-rich images and the need for more diverse training data. Overall, the work contributes to the field by providing a flexible and automatic text evaluation method and a comprehensive multi-modal feedback dataset.The paper introduces PROMETHEUS-VISION, an open-source Vision-Language Model (VLM) evaluator designed to assess the quality of long-form responses generated by VLMs. The evaluation of VLMs is challenging because it requires checking not only whether the model follows the given instruction but also verifying if the text output is grounded in the provided image. To address this, the authors propose using VLMs to evaluate other VLMs, inspired by the 'LM-as-a-Judge' paradigm. They create a new feedback dataset called PERCEPTION COLLECTION, which includes 15K customized score rubrics that users might consider during assessment. Using this dataset, they fine-tune LLaVA-1.5 to create PROMETHEUS-VISION, a VLM evaluator that can understand user-defined score criteria. PROMETHEUS-VISION shows high Pearson correlation with human evaluators and GPT-4V, demonstrating its effectiveness for transparent and accessible evaluation of VLMs. The authors also open-source their code, dataset, and model at <https://github.com/kaistAI/prometheus-vision>. The paper discusses the limitations of PROMETHEUS-VISION, such as its performance with text-rich images and the need for more diverse training data. Overall, the work contributes to the field by providing a flexible and automatic text evaluation method and a comprehensive multi-modal feedback dataset.
Reach us at info@study.space