PROMETHEUS 2: An Open Source Language Model Specialized in Evaluating Other Language Models

PROMETHEUS 2: An Open Source Language Model Specialized in Evaluating Other Language Models

2 May 2024 | Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, Minjoon Seo
Prometheus 2 is an open-source language model designed to evaluate other language models. It addresses the limitations of existing open-source evaluators by providing high correlation with human and proprietary LM judgments, and it supports both direct assessment and pairwise ranking. The model is trained by merging weights of two separate evaluator models trained on direct assessment and pairwise ranking formats. This approach results in a unified evaluator that performs well on both evaluation schemes. Prometheus 2 outperforms existing open evaluator models on four direct assessment benchmarks and four pairwise ranking benchmarks, achieving the highest correlation with human evaluators and proprietary LM judges. The model is publicly available, and its training data includes 1,000 custom evaluation criteria. The research also explores the effectiveness of weight merging in improving evaluation performance, showing that merging models trained on different formats can lead to a more robust evaluator. The study highlights the importance of using open-source models for fair and accessible evaluations, reducing reliance on proprietary models.Prometheus 2 is an open-source language model designed to evaluate other language models. It addresses the limitations of existing open-source evaluators by providing high correlation with human and proprietary LM judgments, and it supports both direct assessment and pairwise ranking. The model is trained by merging weights of two separate evaluator models trained on direct assessment and pairwise ranking formats. This approach results in a unified evaluator that performs well on both evaluation schemes. Prometheus 2 outperforms existing open evaluator models on four direct assessment benchmarks and four pairwise ranking benchmarks, achieving the highest correlation with human evaluators and proprietary LM judges. The model is publicly available, and its training data includes 1,000 custom evaluation criteria. The research also explores the effectiveness of weight merging in improving evaluation performance, showing that merging models trained on different formats can lead to a more robust evaluator. The study highlights the importance of using open-source models for fair and accessible evaluations, reducing reliance on proprietary models.
Reach us at info@study.space
Understanding Prometheus 2%3A An Open Source Language Model Specialized in Evaluating Other Language Models