2024-07-15 | Tu Vu*,1, Kalpesh Krishna*,2, Salaheddin Alzubi3, Chris Tar1, Manaal Faruqui2 and Yun-Hsuan Sung1
The paper introduces FLAME, a family of foundational Large Autorater Models designed to evaluate the output of large language models (LLMs). FLAME is trained on a diverse collection of over 100 quality assessment tasks, comprising more than 5.3 million human judgments, curated and standardized from previous research. The authors address the challenges of human evaluation, such as subjectivity and high costs, by using a large-scale multitask instruction tuning approach. This method enables FLAME to generalize well to a wide range of tasks, outperforming LLMs trained on proprietary data like GPT-4 and Claude-3 on many benchmarks.
FLAME is further optimized for specific downstream applications, such as reward modeling evaluation (FLAME-RM), achieving top performance on the RewardBench benchmark with an accuracy of 87.8%, surpassing both GPT-4-0125 and GPT-4o. Additionally, a computationally efficient approach, FLAME-Opt-RM, is introduced, which optimizes the multitask mixture for targeted distributions using a novel tail-patch fine-tuning strategy, achieving competitive performance with significantly fewer training datapoints.
The paper also highlights that FLAME variants exhibit significantly less bias compared to popular LLM-as-a-Judge models on the CoBBLer autorater bias benchmark. Furthermore, FLAME effectively re-ranks LLM responses in code generation tasks, improving pass@1 accuracy by 6-10% across various settings.
Overall, FLAME outperforms all popular proprietary LLM-as-a-Judge models across 8 out of 12 autorater evaluation benchmarks, covering 53 quality assessment tasks, including RewardBench and LLM-AggreFact. The authors conclude by discussing limitations and future work, emphasizing the need for expanded data collections and exploration of alternative training approaches.The paper introduces FLAME, a family of foundational Large Autorater Models designed to evaluate the output of large language models (LLMs). FLAME is trained on a diverse collection of over 100 quality assessment tasks, comprising more than 5.3 million human judgments, curated and standardized from previous research. The authors address the challenges of human evaluation, such as subjectivity and high costs, by using a large-scale multitask instruction tuning approach. This method enables FLAME to generalize well to a wide range of tasks, outperforming LLMs trained on proprietary data like GPT-4 and Claude-3 on many benchmarks.
FLAME is further optimized for specific downstream applications, such as reward modeling evaluation (FLAME-RM), achieving top performance on the RewardBench benchmark with an accuracy of 87.8%, surpassing both GPT-4-0125 and GPT-4o. Additionally, a computationally efficient approach, FLAME-Opt-RM, is introduced, which optimizes the multitask mixture for targeted distributions using a novel tail-patch fine-tuning strategy, achieving competitive performance with significantly fewer training datapoints.
The paper also highlights that FLAME variants exhibit significantly less bias compared to popular LLM-as-a-Judge models on the CoBBLer autorater bias benchmark. Furthermore, FLAME effectively re-ranks LLM responses in code generation tasks, improving pass@1 accuracy by 6-10% across various settings.
Overall, FLAME outperforms all popular proprietary LLM-as-a-Judge models across 8 out of 12 autorater evaluation benchmarks, covering 53 quality assessment tasks, including RewardBench and LLM-AggreFact. The authors conclude by discussing limitations and future work, emphasizing the need for expanded data collections and exploration of alternative training approaches.