MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues

MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues

25 Jun 2024 | Ge Bai, Jie Liu, Xingyuan Bu, Yancheng He, Jiaheng Liu, Zhanhui Zhou, Zhuoran Lin, Wenbo Su, Tizheng Ge, Bo Zheng, Wanli Ouyang
MT-Bench-101 is a new benchmark designed to evaluate the fine-grained abilities of large language models (LLMs) in multi-turn dialogues. The benchmark is built based on a three-tier hierarchical ability taxonomy, which includes 13 distinct tasks across 1388 multi-turn dialogues, covering 4208 turns. The taxonomy is derived from real multi-turn dialogue data and educational psychology teaching frameworks. The benchmark evaluates 21 popular LLMs, including 2 closed-source and 19 open-source models, to assess their performance in various tasks. The results show that GPT-4 performs the best, with an average score of 8.86. The analysis reveals that neither common alignment techniques nor chat-specific designs significantly improve the multi-turn abilities of LLMs. The benchmark also highlights that tasks in MT-Bench-101 effectively measure the multi-turn chat abilities of LLMs. The data and code are available at https://github.com/mtbench101/mt-bench-101. The benchmark provides a comprehensive evaluation of LLMs in multi-turn dialogues, with a focus on fine-grained abilities. The results indicate that models with larger sizes generally perform better, but the use of alignment techniques or chat-specific designs does not lead to significant improvements in multi-turn abilities. The benchmark also shows that the performance of LLMs varies across different tasks and dialogue turns, with some tasks showing a decline in performance as the dialogue progresses. The evaluation uses GPT-4 as a human-like judge, with a scoring system that considers the lowest score across all turns as the final score for the dialogue. The benchmark also includes case studies and human evaluations to validate the effectiveness of the evaluation method. The results demonstrate that the benchmark provides a reliable and comprehensive assessment of LLMs in multi-turn dialogues.MT-Bench-101 is a new benchmark designed to evaluate the fine-grained abilities of large language models (LLMs) in multi-turn dialogues. The benchmark is built based on a three-tier hierarchical ability taxonomy, which includes 13 distinct tasks across 1388 multi-turn dialogues, covering 4208 turns. The taxonomy is derived from real multi-turn dialogue data and educational psychology teaching frameworks. The benchmark evaluates 21 popular LLMs, including 2 closed-source and 19 open-source models, to assess their performance in various tasks. The results show that GPT-4 performs the best, with an average score of 8.86. The analysis reveals that neither common alignment techniques nor chat-specific designs significantly improve the multi-turn abilities of LLMs. The benchmark also highlights that tasks in MT-Bench-101 effectively measure the multi-turn chat abilities of LLMs. The data and code are available at https://github.com/mtbench101/mt-bench-101. The benchmark provides a comprehensive evaluation of LLMs in multi-turn dialogues, with a focus on fine-grained abilities. The results indicate that models with larger sizes generally perform better, but the use of alignment techniques or chat-specific designs does not lead to significant improvements in multi-turn abilities. The benchmark also shows that the performance of LLMs varies across different tasks and dialogue turns, with some tasks showing a decline in performance as the dialogue progresses. The evaluation uses GPT-4 as a human-like judge, with a scoring system that considers the lowest score across all turns as the final score for the dialogue. The benchmark also includes case studies and human evaluations to validate the effectiveness of the evaluation method. The results demonstrate that the benchmark provides a reliable and comprehensive assessment of LLMs in multi-turn dialogues.
Reach us at info@study.space
[slides] MT-Bench-101%3A A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues | StudySpace