2017 | Iulian Vlad Serban, Alessandro Sordoni*, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron Courville and Yoshua Bengio†
This paper introduces a hierarchical latent variable encoder-decoder model (VHRED) for generating dialogues. The model addresses the limitations of previous neural network architectures in generating meaningful, long, and diverse dialogue responses. The key innovation is the introduction of a hierarchical structure with stochastic latent variables that span multiple levels of the dialogue. The model is trained using a variational lower bound on the log-likelihood, allowing it to capture complex dependencies between sub-sequences in dialogues.
VHRED consists of three main components: an encoder RNN, a context RNN, and a decoder RNN. The encoder RNN encodes each utterance into a real-valued vector, which is then used by the context RNN to compute a hidden state that captures information from all previous utterances. The decoder RNN generates the response word by word, conditioned on the latent variable. The latent variable is sampled from a prior distribution and used to condition the decoder RNN during response generation.
The model is evaluated on the Twitter Dialogue Corpus, where it is compared to other models such as LSTM and HRED. The results show that VHRED significantly outperforms these models in terms of response quality, particularly for long contexts. The model is able to generate more semantically rich and diverse responses, and it is better at maintaining dialogue context. Additionally, the model is able to generate responses in different languages, even though the training data was filtered to include only English tweets.
The paper also discusses the limitations of previous models, such as the shallow generation process that limits their ability to model high-level variability. The hierarchical structure of VHRED allows it to capture the complex dependencies between sub-sequences in dialogues, leading to more meaningful and diverse responses. The model is also able to generate responses that are more aligned with the context, which is crucial for maintaining engaging conversations. The results of the human evaluation study further support the effectiveness of the model in generating high-quality dialogue responses.This paper introduces a hierarchical latent variable encoder-decoder model (VHRED) for generating dialogues. The model addresses the limitations of previous neural network architectures in generating meaningful, long, and diverse dialogue responses. The key innovation is the introduction of a hierarchical structure with stochastic latent variables that span multiple levels of the dialogue. The model is trained using a variational lower bound on the log-likelihood, allowing it to capture complex dependencies between sub-sequences in dialogues.
VHRED consists of three main components: an encoder RNN, a context RNN, and a decoder RNN. The encoder RNN encodes each utterance into a real-valued vector, which is then used by the context RNN to compute a hidden state that captures information from all previous utterances. The decoder RNN generates the response word by word, conditioned on the latent variable. The latent variable is sampled from a prior distribution and used to condition the decoder RNN during response generation.
The model is evaluated on the Twitter Dialogue Corpus, where it is compared to other models such as LSTM and HRED. The results show that VHRED significantly outperforms these models in terms of response quality, particularly for long contexts. The model is able to generate more semantically rich and diverse responses, and it is better at maintaining dialogue context. Additionally, the model is able to generate responses in different languages, even though the training data was filtered to include only English tweets.
The paper also discusses the limitations of previous models, such as the shallow generation process that limits their ability to model high-level variability. The hierarchical structure of VHRED allows it to capture the complex dependencies between sub-sequences in dialogues, leading to more meaningful and diverse responses. The model is also able to generate responses that are more aligned with the context, which is crucial for maintaining engaging conversations. The results of the human evaluation study further support the effectiveness of the model in generating high-quality dialogue responses.