MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large Language Models

MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large Language Models

30 Jan 2024 | Wai-Chung Kwan, Xingshan Zeng, Yuxin Jiang, Yufei Wang, Liangyou Li, Lifeng Shang, Xin Jiang, Qun Liu, Kam-Fai Wong
MT-Eval is a benchmark designed to evaluate the multi-turn conversational abilities of large language models (LLMs). The benchmark categorizes interaction patterns into four types: recollection, expansion, refinement, and follow-up. By analyzing human-LLM conversations, the researchers constructed multi-turn queries for each category, either by augmenting existing datasets or by creating new examples with GPT-4. They also created single-turn versions of the 1170 multi-turn queries to study the factors impacting multi-turn abilities. The evaluation of 11 well-known LLMs showed that while closed-source models generally surpass open-source ones, certain open-source models exceed GPT-3.5-Turbo in specific tasks. The researchers observed significant performance degradation in multi-turn settings compared to single-turn settings in most models, which is not correlated with the models' fundamental capabilities. They identified the distance to relevant content and susceptibility to error propagation as the key factors influencing multi-turn performance. MT-Eval is released publicly to encourage future research towards more robust conversational models. The benchmark includes test sets targeting these four conversation categories, while mirroring everyday scenarios of document processing, content creation, and information retrieval. It comprises 168 dialogue sessions with 1,170 turns to assess models' competence in handling such realistic multi-turn interactions. The researchers constructed evaluation sets for each interaction type by augmenting existing datasets or creating new ones to cover real-world applications. They used GPT-4 to generate new instances to avoid data contamination, and they manually reviewed and revised them for quality assurance. The researchers evaluated 11 popular LLMs, including both open-source and closed-source models. Their findings include: 1) GPT-4 still dominates in multi-turn conversational abilities, but some open-source models have comparable performance to GPT-3.5-Turbo in some tasks. 2) Most LLMs perform worse in the multi-turn setting than in single-turn. The performance gap between the two settings is not related to the model's fundamental capacities. 3) Increasing distance to relevant content negatively impacts performance. 4) Models are prone to error propagation due to sensitivity to dialogue history. The researchers identified key factors that influence LLM multi-turn performance, such as the distance to relevant content and error propagation. They demonstrated the importance of evaluating LLMs in multi-turn settings, highlighting the performance discrepancies that can arise when compared to single-turn evaluations. The benchmark provides a comprehensive view of their multi-turn conversational capabilities. The researchers also conducted ablation studies to investigate the effects of varying dialogue contexts on model's performance. They found that models conditioned on gold context exhibit significant improvement in Recollection and Refinement tasks. They also explored how the distance between the relevant context and the current query affects performance. The researchers concluded that the distance to relevant content and susceptibility to error propagation are the key factors that cause a decline in multi-turn performance. They believeMT-Eval is a benchmark designed to evaluate the multi-turn conversational abilities of large language models (LLMs). The benchmark categorizes interaction patterns into four types: recollection, expansion, refinement, and follow-up. By analyzing human-LLM conversations, the researchers constructed multi-turn queries for each category, either by augmenting existing datasets or by creating new examples with GPT-4. They also created single-turn versions of the 1170 multi-turn queries to study the factors impacting multi-turn abilities. The evaluation of 11 well-known LLMs showed that while closed-source models generally surpass open-source ones, certain open-source models exceed GPT-3.5-Turbo in specific tasks. The researchers observed significant performance degradation in multi-turn settings compared to single-turn settings in most models, which is not correlated with the models' fundamental capabilities. They identified the distance to relevant content and susceptibility to error propagation as the key factors influencing multi-turn performance. MT-Eval is released publicly to encourage future research towards more robust conversational models. The benchmark includes test sets targeting these four conversation categories, while mirroring everyday scenarios of document processing, content creation, and information retrieval. It comprises 168 dialogue sessions with 1,170 turns to assess models' competence in handling such realistic multi-turn interactions. The researchers constructed evaluation sets for each interaction type by augmenting existing datasets or creating new ones to cover real-world applications. They used GPT-4 to generate new instances to avoid data contamination, and they manually reviewed and revised them for quality assurance. The researchers evaluated 11 popular LLMs, including both open-source and closed-source models. Their findings include: 1) GPT-4 still dominates in multi-turn conversational abilities, but some open-source models have comparable performance to GPT-3.5-Turbo in some tasks. 2) Most LLMs perform worse in the multi-turn setting than in single-turn. The performance gap between the two settings is not related to the model's fundamental capacities. 3) Increasing distance to relevant content negatively impacts performance. 4) Models are prone to error propagation due to sensitivity to dialogue history. The researchers identified key factors that influence LLM multi-turn performance, such as the distance to relevant content and error propagation. They demonstrated the importance of evaluating LLMs in multi-turn settings, highlighting the performance discrepancies that can arise when compared to single-turn evaluations. The benchmark provides a comprehensive view of their multi-turn conversational capabilities. The researchers also conducted ablation studies to investigate the effects of varying dialogue contexts on model's performance. They found that models conditioned on gold context exhibit significant improvement in Recollection and Refinement tasks. They also explored how the distance between the relevant context and the current query affects performance. The researchers concluded that the distance to relevant content and susceptibility to error propagation are the key factors that cause a decline in multi-turn performance. They believe
Reach us at info@study.space