SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words

SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words

19 Jun 2024 | Junyi Ao, Yuancheng Wang, Xiaohai Tian, Dekun Chen, Jun Zhang, Lu Lu, Yuxuan Wang, Haizhou Li, Zhizheng Wu
SD-Eval is a benchmark dataset designed for multidimensional evaluation of spoken dialogue understanding and generation. It includes 7,303 utterances, totaling 8.76 hours of speech data, aggregated from eight public datasets. The dataset focuses on paralinguistic and environmental information across four perspectives: emotion, accent, age, and background sound. SD-Eval aims to promote the development of more empathetic and intelligent spoken dialogue systems that can generate appropriate responses based on paralinguistic and environmental information. The dataset includes four subsets: test-emo, test-acc, test-age, and test-env, each focusing on different aspects of speech. To assess SD-Eval, three different models were implemented and a training set was constructed following a similar process. The training set contains 1,052.72 hours of speech data and 724.4k utterances. Comprehensive evaluations were conducted using objective metrics (e.g., BLEU and ROUGE), subjective evaluations, and LLM-based metrics for generated responses. Models conditioned with paralinguistic and environmental information outperformed their counterparts in both objective and subjective measures. Additionally, LLM-based metrics showed a higher correlation with human evaluation compared to traditional metrics. The dataset was constructed by selecting data from eight public datasets, synthesizing data for certain subsets, and normalizing labels. Data filtering was performed to ensure quality, and punctuation restoration was applied to transcripts. Response generation used GPT-4o to generate diverse responses for each utterance, considering content and other speech characteristics. The results showed that VS-LLM outperformed Cascade LLM on all metrics, indicating that using speech as direct input allows VS-LLM to implicitly learn paralinguistic and environmental information. VS-LLM's performance was inferior to that of LLM (Upper Bound), suggesting that the way data is processed is crucial for model performance. Qwen-Audio, an off-the-shelf speech LLM, performed well in many tasks but was not impressive in SD-Eval, highlighting the need for well-defined tasks and datasets in this area. The analysis showed that models using ground-truth transcripts and labels performed better than those using ASR-generated transcripts or emotion labels from SER models. LLM-based metrics showed a higher correlation with human evaluation compared to traditional metrics. SD-Eval aims to advance the development of spoken dialogue systems capable of generating appropriate responses by considering paralinguistic and environmental information. The dataset has limitations, including focusing only on speech-to-text dialogues and supporting single-turn dialogues. Future work includes addressing these limitations and developing a benchmark dataset for multi-turn speech-to-speech dialogues.SD-Eval is a benchmark dataset designed for multidimensional evaluation of spoken dialogue understanding and generation. It includes 7,303 utterances, totaling 8.76 hours of speech data, aggregated from eight public datasets. The dataset focuses on paralinguistic and environmental information across four perspectives: emotion, accent, age, and background sound. SD-Eval aims to promote the development of more empathetic and intelligent spoken dialogue systems that can generate appropriate responses based on paralinguistic and environmental information. The dataset includes four subsets: test-emo, test-acc, test-age, and test-env, each focusing on different aspects of speech. To assess SD-Eval, three different models were implemented and a training set was constructed following a similar process. The training set contains 1,052.72 hours of speech data and 724.4k utterances. Comprehensive evaluations were conducted using objective metrics (e.g., BLEU and ROUGE), subjective evaluations, and LLM-based metrics for generated responses. Models conditioned with paralinguistic and environmental information outperformed their counterparts in both objective and subjective measures. Additionally, LLM-based metrics showed a higher correlation with human evaluation compared to traditional metrics. The dataset was constructed by selecting data from eight public datasets, synthesizing data for certain subsets, and normalizing labels. Data filtering was performed to ensure quality, and punctuation restoration was applied to transcripts. Response generation used GPT-4o to generate diverse responses for each utterance, considering content and other speech characteristics. The results showed that VS-LLM outperformed Cascade LLM on all metrics, indicating that using speech as direct input allows VS-LLM to implicitly learn paralinguistic and environmental information. VS-LLM's performance was inferior to that of LLM (Upper Bound), suggesting that the way data is processed is crucial for model performance. Qwen-Audio, an off-the-shelf speech LLM, performed well in many tasks but was not impressive in SD-Eval, highlighting the need for well-defined tasks and datasets in this area. The analysis showed that models using ground-truth transcripts and labels performed better than those using ASR-generated transcripts or emotion labels from SER models. LLM-based metrics showed a higher correlation with human evaluation compared to traditional metrics. SD-Eval aims to advance the development of spoken dialogue systems capable of generating appropriate responses by considering paralinguistic and environmental information. The dataset has limitations, including focusing only on speech-to-text dialogues and supporting single-turn dialogues. Future work includes addressing these limitations and developing a benchmark dataset for multi-turn speech-to-speech dialogues.
Reach us at info@study.space
Understanding SD-Eval%3A A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words