Understanding Retrieval Augmented End-to-End Spoken Dialog Models

The paper introduces a retrieval-augmented speech understanding model (ReSLM) that enhances the performance of speech dialog systems by leveraging additional contextual information. ReSLM combines a pretrained speech model and a large language model (LLM) to infer dialog states directly from audio signals. The key innovation is the integration of a speech retriever, which identifies domain-specific entities mentioned in the audio, such as restaurant names, hotel names, and city names. These entities are then added to the LLM inputs to bias the model's predictions. The authors evaluate ReSLM on the DSTC-11 challenge, a speech dialog state tracking task, and demonstrate significant improvements in joint goal accuracy (JGA), slot error rate (SER), and word error rate (WER). The retrieval augmentation significantly enhances the model's ability to recognize rare and domain-specific entities, outperforming all submissions in the challenge. The approach is applicable to various speech tasks requiring contextual information or domain-specific entities, such as contextual ASR with biasing capability.The paper introduces a retrieval-augmented speech understanding model (ReSLM) that enhances the performance of speech dialog systems by leveraging additional contextual information. ReSLM combines a pretrained speech model and a large language model (LLM) to infer dialog states directly from audio signals. The key innovation is the integration of a speech retriever, which identifies domain-specific entities mentioned in the audio, such as restaurant names, hotel names, and city names. These entities are then added to the LLM inputs to bias the model's predictions. The authors evaluate ReSLM on the DSTC-11 challenge, a speech dialog state tracking task, and demonstrate significant improvements in joint goal accuracy (JGA), slot error rate (SER), and word error rate (WER). The retrieval augmentation significantly enhances the model's ability to recognize rare and domain-specific entities, outperforming all submissions in the challenge. The approach is applicable to various speech tasks requiring contextual information or domain-specific entities, such as contextual ASR with biasing capability.

RETRIEVAL AUGMENTED END-TO-END SPOKEN DIALOG MODELS

2 Feb 2024 | Mingqiu Wang, Izhak Shafran, Hagen Soltau, Wei Han, Yuan Cao, Dian Yu, Laurent El Shafey