SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words

SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words

19 Jun 2024 | Junyi Ao1*, Yuancheng Wang1*, Xiaohai Tian2, Dekun Chen1, Jun Zhang2, Lu Lu2, Yuxuan Wang2, Haizhou Li1, Zhizheng Wu1†
**SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words** **Authors:** Junyi Ao, Yuancheng Wang, Xiaohai Tian, Dekun Chen, Jun Zhang, Lu Lu, Yuxuan Wang, Haizhou Li, Zhizheng Wu **Affiliations:** The Chinese University of Hong Kong, Shenzhen; ByteDance **Abstract:** Speech contains rich information, including content, paralinguistic, and environmental aspects, which significantly impact communication and human-computer interaction. While large language models (LLMs) have evolved to handle multi-modal inputs, they often fail to generate appropriate responses due to a lack of principles on task definition and model development. To address this, the authors present SD-Eval, a benchmark dataset designed for multidimensional evaluation of spoken dialogue understanding and generation. SD-Eval focuses on paralinguistic and environmental information, comprising 7,303 utterances totaling 8.76 hours of speech data. The dataset is aggregated from eight public datasets, covering four perspectives: emotion, accent, age, and background sound. To evaluate SD-Eval, three models are implemented, and a training set is constructed following a similar process. The evaluation includes objective methods (BLEU, ROUGE), subjective assessments, and LLM-based metrics. Results show that models conditioned on paralinguistic and environmental information outperform others in both objective and subjective measures. Additionally, LLM-based metrics show higher correlation with human evaluation compared to traditional metrics. **Key Contributions:** - **SD-Eval Dataset:** A comprehensive benchmark dataset for evaluating spoken dialogue systems, focusing on paralinguistic and environmental information. - **Model Evaluation:** Implementation of three models and a detailed evaluation process, including objective and subjective metrics. - **LLM-Based Metrics:** Showed higher correlation with human evaluation, highlighting their effectiveness in assessing spoken dialogue generation. **Conclusion:** SD-Eval aims to advance the creation of more empathetic and intelligent spoken dialogue systems by providing a robust benchmark for evaluating models based on paralinguistic and environmental information. Future work will focus on expanding the dataset to include more aspects and improving multi-turn dialogue evaluation.**SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words** **Authors:** Junyi Ao, Yuancheng Wang, Xiaohai Tian, Dekun Chen, Jun Zhang, Lu Lu, Yuxuan Wang, Haizhou Li, Zhizheng Wu **Affiliations:** The Chinese University of Hong Kong, Shenzhen; ByteDance **Abstract:** Speech contains rich information, including content, paralinguistic, and environmental aspects, which significantly impact communication and human-computer interaction. While large language models (LLMs) have evolved to handle multi-modal inputs, they often fail to generate appropriate responses due to a lack of principles on task definition and model development. To address this, the authors present SD-Eval, a benchmark dataset designed for multidimensional evaluation of spoken dialogue understanding and generation. SD-Eval focuses on paralinguistic and environmental information, comprising 7,303 utterances totaling 8.76 hours of speech data. The dataset is aggregated from eight public datasets, covering four perspectives: emotion, accent, age, and background sound. To evaluate SD-Eval, three models are implemented, and a training set is constructed following a similar process. The evaluation includes objective methods (BLEU, ROUGE), subjective assessments, and LLM-based metrics. Results show that models conditioned on paralinguistic and environmental information outperform others in both objective and subjective measures. Additionally, LLM-based metrics show higher correlation with human evaluation compared to traditional metrics. **Key Contributions:** - **SD-Eval Dataset:** A comprehensive benchmark dataset for evaluating spoken dialogue systems, focusing on paralinguistic and environmental information. - **Model Evaluation:** Implementation of three models and a detailed evaluation process, including objective and subjective metrics. - **LLM-Based Metrics:** Showed higher correlation with human evaluation, highlighting their effectiveness in assessing spoken dialogue generation. **Conclusion:** SD-Eval aims to advance the creation of more empathetic and intelligent spoken dialogue systems by providing a robust benchmark for evaluating models based on paralinguistic and environmental information. Future work will focus on expanding the dataset to include more aspects and improving multi-turn dialogue evaluation.
Reach us at info@study.space