PROMETHEUS 2: An Open Source Language Model Specialized in Evaluating Other Language Models

PROMETHEUS 2: An Open Source Language Model Specialized in Evaluating Other Language Models

2 May 2024 | Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, Minjoon Seo
**Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models** Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, Minjoon Seo **Abstract:** Proprietary LMs like GPT-4 are commonly used to assess the quality of responses from various LMs. However, concerns about transparency, controllability, and affordability have led to the development of open-source LMs specialized in evaluations. Existing open evaluator LMs, however, have significant shortcomings: they produce scores that diverge from human judgments and lack the flexibility to handle both direct assessment and pairwise ranking, the two most common forms of evaluation. Additionally, they cannot evaluate based on custom criteria, focusing instead on general attributes like helpfulness and harmlessness. To address these issues, we introduce Prometheus 2, a more powerful evaluator LM that closely mirrors human and GPT-4 judgments. It can process both direct assessment and pairwise ranking formats with user-defined evaluation criteria. On four direct assessment benchmarks and four pairwise ranking benchmarks, Prometheus 2 achieves the highest correlation and agreement with humans and proprietary LM judges among all tested open evaluator LMs. Our models, code, and data are publicly available. **Introduction:** Evaluating the quality of outputs from language models (LMs) is becoming increasingly challenging due to the diverse distribution of text and complex tasks. Language model-based evaluation has emerged as a scalable and cost-effective paradigm for assessing LM-generated text. Prior works using proprietary LMs as evaluators have demonstrated high correlations with human evaluations and increased speed and cost-effectiveness. However, relying on proprietary LMs poses significant challenges, including lack of transparency, fairness, and compliance. Recent works have focused on developing open-access, transparent, and controllable evaluator LMs. Yet, these models often yield scores that do not correlate well with human judgments or proprietary LMs, failing to effectively simulate them. Moreover, open evaluator LMs are not flexible, typically trained only for either direct assessment or pairwise ranking, limiting their ability to handle diverse real-life scenarios. To bridge this gap, we investigate unifying the two model-based evaluation paradigms—direct assessment and pairwise ranking—to train a robust unified evaluator LM. We propose a recipe based on merging the weights of two evaluator LMs trained separately on direct assessment and pairwise ranking formats. Our key empirical observation is that weight merging can yield an evaluator LM that not only works in both formats but also outperforms evaluator LMs that are jointly trained or only trained on a single format. **Contributions:** - We introduce Prometheus 2 (7B & 8x7B), state-of-the-art open evaluator LMs that score high correlations with both human evaluators and proprietary LM-based judges on both direct assessment and pairwise ranking. - We introduce**Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models** Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, Minjoon Seo **Abstract:** Proprietary LMs like GPT-4 are commonly used to assess the quality of responses from various LMs. However, concerns about transparency, controllability, and affordability have led to the development of open-source LMs specialized in evaluations. Existing open evaluator LMs, however, have significant shortcomings: they produce scores that diverge from human judgments and lack the flexibility to handle both direct assessment and pairwise ranking, the two most common forms of evaluation. Additionally, they cannot evaluate based on custom criteria, focusing instead on general attributes like helpfulness and harmlessness. To address these issues, we introduce Prometheus 2, a more powerful evaluator LM that closely mirrors human and GPT-4 judgments. It can process both direct assessment and pairwise ranking formats with user-defined evaluation criteria. On four direct assessment benchmarks and four pairwise ranking benchmarks, Prometheus 2 achieves the highest correlation and agreement with humans and proprietary LM judges among all tested open evaluator LMs. Our models, code, and data are publicly available. **Introduction:** Evaluating the quality of outputs from language models (LMs) is becoming increasingly challenging due to the diverse distribution of text and complex tasks. Language model-based evaluation has emerged as a scalable and cost-effective paradigm for assessing LM-generated text. Prior works using proprietary LMs as evaluators have demonstrated high correlations with human evaluations and increased speed and cost-effectiveness. However, relying on proprietary LMs poses significant challenges, including lack of transparency, fairness, and compliance. Recent works have focused on developing open-access, transparent, and controllable evaluator LMs. Yet, these models often yield scores that do not correlate well with human judgments or proprietary LMs, failing to effectively simulate them. Moreover, open evaluator LMs are not flexible, typically trained only for either direct assessment or pairwise ranking, limiting their ability to handle diverse real-life scenarios. To bridge this gap, we investigate unifying the two model-based evaluation paradigms—direct assessment and pairwise ranking—to train a robust unified evaluator LM. We propose a recipe based on merging the weights of two evaluator LMs trained separately on direct assessment and pairwise ranking formats. Our key empirical observation is that weight merging can yield an evaluator LM that not only works in both formats but also outperforms evaluator LMs that are jointly trained or only trained on a single format. **Contributions:** - We introduce Prometheus 2 (7B & 8x7B), state-of-the-art open evaluator LMs that score high correlations with both human evaluators and proprietary LM-based judges on both direct assessment and pairwise ranking. - We introduce
Reach us at info@study.space
Understanding Prometheus 2%3A An Open Source Language Model Specialized in Evaluating Other Language Models