25 Jun 2024 | Ge Bai*†1, Jie Liu*‡2,3, Xingyuan Bu*†1, Yancheng He1, Jiaheng Liu1, Zanhui Zhou3, Zhuoran Lin1, Wenbo Su1, Tiezheng Ge1, Bo Zheng1, Wanli Ouyang2,3
The paper introduces MT-Bench-101, a comprehensive benchmark designed to evaluate the fine-grained abilities of Large Language Models (LLMs) in multi-turn dialogues. The benchmark is constructed based on a detailed analysis of real multi-turn dialogue data and includes a three-tier hierarchical ability taxonomy with 4208 turns across 1388 dialogues in 13 distinct tasks. The evaluation covers three overarching abilities—perceptivity, adaptability, and interactivity—and seven detailed abilities. The study assesses 21 popular LLMs, including both close-sourced and open-sourced models, and finds that GPT-4 performs the best. Key findings include the identification of adaptability and interactivity as key deficiencies in existing LLMs, the improvement of model performance with larger sizes, and the lack of significant enhancements from common alignment techniques or chat-specific designs. The benchmark's effectiveness is validated through extensive case studies and human evaluation, demonstrating high agreement between GPT-4 and human experts. The data and code are available for public use.The paper introduces MT-Bench-101, a comprehensive benchmark designed to evaluate the fine-grained abilities of Large Language Models (LLMs) in multi-turn dialogues. The benchmark is constructed based on a detailed analysis of real multi-turn dialogue data and includes a three-tier hierarchical ability taxonomy with 4208 turns across 1388 dialogues in 13 distinct tasks. The evaluation covers three overarching abilities—perceptivity, adaptability, and interactivity—and seven detailed abilities. The study assesses 21 popular LLMs, including both close-sourced and open-sourced models, and finds that GPT-4 performs the best. Key findings include the identification of adaptability and interactivity as key deficiencies in existing LLMs, the improvement of model performance with larger sizes, and the lack of significant enhancements from common alignment techniques or chat-specific designs. The benchmark's effectiveness is validated through extensive case studies and human evaluation, demonstrating high agreement between GPT-4 and human experts. The data and code are available for public use.