RETRIEVAL AUGMENTED END-TO-END SPOKEN DIALOG MODELS

RETRIEVAL AUGMENTED END-TO-END SPOKEN DIALOG MODELS

2 Feb 2024 | Mingqiu Wang, Izhak Shafran, Hagen Soltau, Wei Han, Yuan Cao, Dian Yu, Laurent El Shafey
This paper introduces Retrieval-Augmented Speech Understanding Model (ReSLM), which integrates a speech retriever with a joint speech and language model (SLM) to enhance speech dialog performance. SLM combines a pretrained speech model with a large language model (LLM), enabling in-context learning. ReSLM addresses the challenge of identifying domain-specific entities in speech dialog systems by retrieving text entities from audio signals. The speech retriever, based on a dual-encoder architecture, retrieves relevant text entities from the audio input, which are then added as text inputs to the SLM to improve model predictions. The paper evaluates ReSLM on the DSTC-11 dialog tracking task, which involves tracking dialog states in spoken dialogues. ReSLM outperforms existing systems by achieving higher joint goal accuracy (38.6% vs 32.7%) and lower slot error rate (20.6% vs 24.8%) and ASR word error rate (5.5% vs 6.7%). The model's performance is further validated through ablation studies, showing significant improvements in recognizing specific entities like hotel names, restaurant names, and train stations. ReSLM's approach is not limited to dialog state tracking but can be applied to other speech tasks requiring contextual information or domain-specific entities, such as contextual ASR with biasing capability. The paper also discusses the limitations of traditional speech dialog systems, which rely on a cascaded approach of ASR followed by NLU, and highlights the benefits of using retrieval-augmented models to improve accuracy and reduce errors. The study demonstrates that integrating contextual information through retrieval significantly enhances the performance of speech dialog systems, particularly in handling rare domain-specific entities.This paper introduces Retrieval-Augmented Speech Understanding Model (ReSLM), which integrates a speech retriever with a joint speech and language model (SLM) to enhance speech dialog performance. SLM combines a pretrained speech model with a large language model (LLM), enabling in-context learning. ReSLM addresses the challenge of identifying domain-specific entities in speech dialog systems by retrieving text entities from audio signals. The speech retriever, based on a dual-encoder architecture, retrieves relevant text entities from the audio input, which are then added as text inputs to the SLM to improve model predictions. The paper evaluates ReSLM on the DSTC-11 dialog tracking task, which involves tracking dialog states in spoken dialogues. ReSLM outperforms existing systems by achieving higher joint goal accuracy (38.6% vs 32.7%) and lower slot error rate (20.6% vs 24.8%) and ASR word error rate (5.5% vs 6.7%). The model's performance is further validated through ablation studies, showing significant improvements in recognizing specific entities like hotel names, restaurant names, and train stations. ReSLM's approach is not limited to dialog state tracking but can be applied to other speech tasks requiring contextual information or domain-specific entities, such as contextual ASR with biasing capability. The paper also discusses the limitations of traditional speech dialog systems, which rely on a cascaded approach of ASR followed by NLU, and highlights the benefits of using retrieval-augmented models to improve accuracy and reduce errors. The study demonstrates that integrating contextual information through retrieval significantly enhances the performance of speech dialog systems, particularly in handling rare domain-specific entities.
Reach us at info@study.space