MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large Language Models

MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large Language Models

30 Jan 2024 | Wai-Chung Kwan, Xingshan Zeng, Yuxin Jiang, Yufei Wang, Liangyou Li, Lifeng Shang, Xin Jiang, Qun Liu, Kam-Fai Wong
**MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large Language Models** **Authors:** Wai-Chung Kwan, Xingshan Zeng, Yuxin Jiang, Yufei Wang, Liangyou Li, Lifeng Shang, Xin Jiang, Qun Liu, Kam-Fai Wong **Institution:** The Chinese University of Hong Kong, Huawei Noah’s Ark Lab, The Hong Kong University of Science and Technology **Abstract:** Large language models (LLMs) are increasingly used in complex multi-turn conversations across various real-world applications. However, existing benchmarks primarily focus on single-turn evaluations, neglecting the models' multi-turn interaction capabilities. To address this gap, the authors introduce MT-Eval, a comprehensive benchmark designed to evaluate LLMs' multi-turn conversational abilities. By analyzing human-LLM conversations, they categorize interaction patterns into four types: recollection, expansion, refinement, and follow-up. They construct multi-turn queries for each category, either by augmenting existing datasets or creating new examples with GPT-4 to avoid data leakage. To study the factors affecting multi-turn abilities, they create single-turn versions of the 1170 multi-turn queries and compare performance. The evaluation of 11 well-known LLMs shows that closed-source models generally outperform open-source ones, but some open-source models match or exceed GPT-3.5-Turbo in specific tasks. The study observes significant performance degradation in multi-turn settings compared to single-turn settings, which is not correlated with the models' fundamental capabilities. Key factors influencing multi-turn performance include the distance to relevant content and susceptibility to error propagation. MT-Eval is released publicly to encourage future research on more robust conversational models. **Contributions:** - Propose a comprehensive multi-turn conversational capabilities evaluation benchmark. - Provide an in-depth analysis of 11 popular LLMs across the benchmark. - Identify key factors affecting LLM multi-turn performance. - Highlight the importance of evaluating LLMs in multi-turn settings. **Related Work:** Recent advancements in LLMs have improved their ability to engage in human-like, multi-turn conversations. However, limited studies have focused on multi-turn conversation capabilities. Previous work includes MT-Bench, which evaluates conversational flow and instruction-following capabilities, and HALIE, a framework for evaluating human-AI interaction. This work evaluates LLMs' comprehensive multi-turn conversation abilities, covering various real-world scenarios. **MT-Eval:** - Designed to evaluate LLMs' multi-turn conversation capabilities. - Categorizes interaction patterns into four types: recollection, expansion, refinement, and follow-up. - Constructs evaluation sets for each interaction type, using GPT-4 to generate new instances. - Compares multi-turn and single-turn performance to measure the gap between them. **Results:** - Most models perform worse in multi-turn settings compared to single-turn settings.**MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large Language Models** **Authors:** Wai-Chung Kwan, Xingshan Zeng, Yuxin Jiang, Yufei Wang, Liangyou Li, Lifeng Shang, Xin Jiang, Qun Liu, Kam-Fai Wong **Institution:** The Chinese University of Hong Kong, Huawei Noah’s Ark Lab, The Hong Kong University of Science and Technology **Abstract:** Large language models (LLMs) are increasingly used in complex multi-turn conversations across various real-world applications. However, existing benchmarks primarily focus on single-turn evaluations, neglecting the models' multi-turn interaction capabilities. To address this gap, the authors introduce MT-Eval, a comprehensive benchmark designed to evaluate LLMs' multi-turn conversational abilities. By analyzing human-LLM conversations, they categorize interaction patterns into four types: recollection, expansion, refinement, and follow-up. They construct multi-turn queries for each category, either by augmenting existing datasets or creating new examples with GPT-4 to avoid data leakage. To study the factors affecting multi-turn abilities, they create single-turn versions of the 1170 multi-turn queries and compare performance. The evaluation of 11 well-known LLMs shows that closed-source models generally outperform open-source ones, but some open-source models match or exceed GPT-3.5-Turbo in specific tasks. The study observes significant performance degradation in multi-turn settings compared to single-turn settings, which is not correlated with the models' fundamental capabilities. Key factors influencing multi-turn performance include the distance to relevant content and susceptibility to error propagation. MT-Eval is released publicly to encourage future research on more robust conversational models. **Contributions:** - Propose a comprehensive multi-turn conversational capabilities evaluation benchmark. - Provide an in-depth analysis of 11 popular LLMs across the benchmark. - Identify key factors affecting LLM multi-turn performance. - Highlight the importance of evaluating LLMs in multi-turn settings. **Related Work:** Recent advancements in LLMs have improved their ability to engage in human-like, multi-turn conversations. However, limited studies have focused on multi-turn conversation capabilities. Previous work includes MT-Bench, which evaluates conversational flow and instruction-following capabilities, and HALIE, a framework for evaluating human-AI interaction. This work evaluates LLMs' comprehensive multi-turn conversation abilities, covering various real-world scenarios. **MT-Eval:** - Designed to evaluate LLMs' multi-turn conversation capabilities. - Categorizes interaction patterns into four types: recollection, expansion, refinement, and follow-up. - Constructs evaluation sets for each interaction type, using GPT-4 to generate new instances. - Compares multi-turn and single-turn performance to measure the gap between them. **Results:** - Most models perform worse in multi-turn settings compared to single-turn settings.
Reach us at info@study.space
[slides] MT-Eval%3A A Multi-Turn Capabilities Evaluation Benchmark for Large Language Models | StudySpace