Foundational Autoraters: Taming Large Language Models for Better Automatic Evaluation

Foundational Autoraters: Taming Large Language Models for Better Automatic Evaluation

2024-07-15 | Tu Vu, Kalpesh Krishna, Salaheddin Alzubi, Chris Tar, Manaal Faruqui, Yun-Hsuan Sung
This paper introduces FLAMe, a family of foundational large language model (LLM) autoraters trained on a diverse collection of 100+ quality assessment tasks, comprising over 5 million human judgments. FLAMe is trained on publicly available, permissively licensed data from previous research, making it a reliable and unbiased source for automatic evaluation. The model is designed to generalize well across a wide range of tasks and outperforms proprietary LLM-as-a-Judge models like GPT-4 and Claude-3 on many benchmarks. FLAMe-RM, a variant fine-tuned for reward modeling evaluation, achieves an accuracy of 87.8% on the RewardBench benchmark, outperforming both GPT-4-0125 and GPT-4o. Additionally, FLAMe-Opt-RM, a more computationally efficient version, achieves competitive performance with significantly less training data. The FLAMe variants outperform popular proprietary LLM-as-a-Judge models on 8 out of 12 autorater evaluation benchmarks, covering 53 quality assessment tasks, including RewardBench and LLM-AggreFact. FLAMe is also found to be significantly less biased than these models on the CoBBLEr autorater bias benchmark, while effectively identifying high-quality responses for code generation. The paper also discusses the limitations of FLAMe, including potential performance issues on multilingual and long-context tasks, and suggests future work in expanding the data collection and exploring alternative training approaches. Ethical considerations and risks associated with LLM autoraters are also addressed, emphasizing the need for transparency, bias audits, and diverse perspectives to ensure fairness and accountability.This paper introduces FLAMe, a family of foundational large language model (LLM) autoraters trained on a diverse collection of 100+ quality assessment tasks, comprising over 5 million human judgments. FLAMe is trained on publicly available, permissively licensed data from previous research, making it a reliable and unbiased source for automatic evaluation. The model is designed to generalize well across a wide range of tasks and outperforms proprietary LLM-as-a-Judge models like GPT-4 and Claude-3 on many benchmarks. FLAMe-RM, a variant fine-tuned for reward modeling evaluation, achieves an accuracy of 87.8% on the RewardBench benchmark, outperforming both GPT-4-0125 and GPT-4o. Additionally, FLAMe-Opt-RM, a more computationally efficient version, achieves competitive performance with significantly less training data. The FLAMe variants outperform popular proprietary LLM-as-a-Judge models on 8 out of 12 autorater evaluation benchmarks, covering 53 quality assessment tasks, including RewardBench and LLM-AggreFact. FLAMe is also found to be significantly less biased than these models on the CoBBLEr autorater bias benchmark, while effectively identifying high-quality responses for code generation. The paper also discusses the limitations of FLAMe, including potential performance issues on multilingual and long-context tasks, and suggests future work in expanding the data collection and exploring alternative training approaches. Ethical considerations and risks associated with LLM autoraters are also addressed, emphasizing the need for transparency, bias audits, and diverse perspectives to ensure fairness and accountability.
Reach us at info@study.space