**MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large Language Models**
**Authors:** Wai-Chung Kwan, Xingshan Zeng, Yuxin Jiang, Yufei Wang, Liangyou Li, Lifeng Shang, Xin Jiang, Qun Liu, Kam-Fai Wong
**Institution:** The Chinese University of Hong Kong, Huawei Noah’s Ark Lab, The Hong Kong University of Science and Technology
**Abstract:**
Large language models (LLMs) are increasingly used in complex multi-turn conversations across various real-world applications. However, existing benchmarks primarily focus on single-turn evaluations, neglecting the models' multi-turn interaction capabilities. To address this gap, the authors introduce MT-Eval, a comprehensive benchmark designed to evaluate LLMs' multi-turn conversational abilities. By analyzing human-LLM conversations, they categorize interaction patterns into four types: recollection, expansion, refinement, and follow-up. They construct multi-turn queries for each category, either by augmenting existing datasets or creating new examples with GPT-4 to avoid data leakage. To study the factors affecting multi-turn abilities, they create single-turn versions of the 1170 multi-turn queries and compare performance. The evaluation of 11 well-known LLMs shows that closed-source models generally outperform open-source ones, but some open-source models match or exceed GPT-3.5-Turbo in specific tasks. The study observes significant performance degradation in multi-turn settings compared to single-turn settings, which is not correlated with the models' fundamental capabilities. Key factors influencing multi-turn performance include the distance to relevant content and susceptibility to error propagation. MT-Eval is released publicly to encourage future research on more robust conversational models.
**Contributions:**
- Propose a comprehensive multi-turn conversational capabilities evaluation benchmark.
- Provide an in-depth analysis of 11 popular LLMs across the benchmark.
- Identify key factors affecting LLM multi-turn performance.
- Highlight the importance of evaluating LLMs in multi-turn settings.
**Related Work:**
Recent advancements in LLMs have improved their ability to engage in human-like, multi-turn conversations. However, limited studies have focused on multi-turn conversation capabilities. Previous work includes MT-Bench, which evaluates conversational flow and instruction-following capabilities, and HALIE, a framework for evaluating human-AI interaction. This work evaluates LLMs' comprehensive multi-turn conversation abilities, covering various real-world scenarios.
**MT-Eval:**
- Designed to evaluate LLMs' multi-turn conversation capabilities.
- Categorizes interaction patterns into four types: recollection, expansion, refinement, and follow-up.
- Constructs evaluation sets for each interaction type, using GPT-4 to generate new instances.
- Compares multi-turn and single-turn performance to measure the gap between them.
**Results:**
- Most models perform worse in multi-turn settings compared to single-turn settings.**MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large Language Models**
**Authors:** Wai-Chung Kwan, Xingshan Zeng, Yuxin Jiang, Yufei Wang, Liangyou Li, Lifeng Shang, Xin Jiang, Qun Liu, Kam-Fai Wong
**Institution:** The Chinese University of Hong Kong, Huawei Noah’s Ark Lab, The Hong Kong University of Science and Technology
**Abstract:**
Large language models (LLMs) are increasingly used in complex multi-turn conversations across various real-world applications. However, existing benchmarks primarily focus on single-turn evaluations, neglecting the models' multi-turn interaction capabilities. To address this gap, the authors introduce MT-Eval, a comprehensive benchmark designed to evaluate LLMs' multi-turn conversational abilities. By analyzing human-LLM conversations, they categorize interaction patterns into four types: recollection, expansion, refinement, and follow-up. They construct multi-turn queries for each category, either by augmenting existing datasets or creating new examples with GPT-4 to avoid data leakage. To study the factors affecting multi-turn abilities, they create single-turn versions of the 1170 multi-turn queries and compare performance. The evaluation of 11 well-known LLMs shows that closed-source models generally outperform open-source ones, but some open-source models match or exceed GPT-3.5-Turbo in specific tasks. The study observes significant performance degradation in multi-turn settings compared to single-turn settings, which is not correlated with the models' fundamental capabilities. Key factors influencing multi-turn performance include the distance to relevant content and susceptibility to error propagation. MT-Eval is released publicly to encourage future research on more robust conversational models.
**Contributions:**
- Propose a comprehensive multi-turn conversational capabilities evaluation benchmark.
- Provide an in-depth analysis of 11 popular LLMs across the benchmark.
- Identify key factors affecting LLM multi-turn performance.
- Highlight the importance of evaluating LLMs in multi-turn settings.
**Related Work:**
Recent advancements in LLMs have improved their ability to engage in human-like, multi-turn conversations. However, limited studies have focused on multi-turn conversation capabilities. Previous work includes MT-Bench, which evaluates conversational flow and instruction-following capabilities, and HALIE, a framework for evaluating human-AI interaction. This work evaluates LLMs' comprehensive multi-turn conversation abilities, covering various real-world scenarios.
**MT-Eval:**
- Designed to evaluate LLMs' multi-turn conversation capabilities.
- Categorizes interaction patterns into four types: recollection, expansion, refinement, and follow-up.
- Constructs evaluation sets for each interaction type, using GPT-4 to generate new instances.
- Compares multi-turn and single-turn performance to measure the gap between them.
**Results:**
- Most models perform worse in multi-turn settings compared to single-turn settings.