6 Apr 2016 | Iulian V. Serban*, Alessandro Sordoni*, Yoshua Bengio1*, Aaron Courville* and Joelle Pineau†
This paper presents a method for building end-to-end dialogue systems using generative hierarchical neural network models. The authors propose a hierarchical recurrent encoder-decoder (HRED) model that is competitive with state-of-the-art neural language models and backoff n-gram models. The model is extended to the dialogue domain and demonstrated to perform well on the MovieTriples dataset, which is based on movie scripts. The model is trained using bootstrapping from pretrained word embeddings and a larger question-answer pair corpus. The authors also introduce a bidirectional HRED variant that improves performance by capturing more context. The model is evaluated using word perplexity and word classification error, with results showing that bootstrapping from the SubTle corpus significantly improves performance. The authors also discuss the limitations of the model, including the difficulty of learning topic-specific embeddings and the potential bias of metrics based on MAP outputs. The paper concludes that further research is needed to improve dialogue systems by exploring neural architectures that explicitly separate semantic and syntactic structures and by incorporating additional context and modalities such as audio and video.This paper presents a method for building end-to-end dialogue systems using generative hierarchical neural network models. The authors propose a hierarchical recurrent encoder-decoder (HRED) model that is competitive with state-of-the-art neural language models and backoff n-gram models. The model is extended to the dialogue domain and demonstrated to perform well on the MovieTriples dataset, which is based on movie scripts. The model is trained using bootstrapping from pretrained word embeddings and a larger question-answer pair corpus. The authors also introduce a bidirectional HRED variant that improves performance by capturing more context. The model is evaluated using word perplexity and word classification error, with results showing that bootstrapping from the SubTle corpus significantly improves performance. The authors also discuss the limitations of the model, including the difficulty of learning topic-specific embeddings and the potential bias of metrics based on MAP outputs. The paper concludes that further research is needed to improve dialogue systems by exploring neural architectures that explicitly separate semantic and syntactic structures and by incorporating additional context and modalities such as audio and video.