April 19-23, 2021 | Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhao Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston
This paper presents recipes for building open-domain chatbots that perform well in human evaluations. The authors highlight that while scaling neural models in terms of parameters and training data improves results, other factors are also crucial. These include blending conversational skills such as providing engaging talking points, displaying knowledge, empathy, and personality, while maintaining a consistent persona. They show that large-scale models can learn these skills when given appropriate training data and generation strategies. The authors build variants of these recipes with 90M, 2.7B, and 9.4B parameter models and make their models and code publicly available. Human evaluations show their best models outperform existing approaches in multi-turn dialogue on engagingness and humanness measurements.
The paper discusses the limitations of their work by analyzing failure cases of their models. They note that their models still display a lack of in-depth knowledge if sufficiently interrogated, a tendency to stick to simpler language, and a tendency to repeat oftused phrases. They suggest that unlikelihood training and retrieve-and-refine mechanisms are potential avenues for fixing these problems. However, their initial experiments with these methods are inconclusive. The authors also discuss future possibilities for alleviating these problems and methods for evaluating them.
The paper also presents a detailed evaluation of their models using various metrics, including perplexity, safety, and human evaluations. They compare their models to existing chatbots such as DialogGPT and Meena, showing that their best models outperform these in terms of engagingness and humanness. They also discuss the limitations of their evaluation setup, including the short length of conversations and the potential for bias in human evaluations. The authors conclude that while their models show promise, there is still much work to be done to create truly effective open-domain chatbots. They also emphasize the importance of releasing models to enable full insight into their capabilities.This paper presents recipes for building open-domain chatbots that perform well in human evaluations. The authors highlight that while scaling neural models in terms of parameters and training data improves results, other factors are also crucial. These include blending conversational skills such as providing engaging talking points, displaying knowledge, empathy, and personality, while maintaining a consistent persona. They show that large-scale models can learn these skills when given appropriate training data and generation strategies. The authors build variants of these recipes with 90M, 2.7B, and 9.4B parameter models and make their models and code publicly available. Human evaluations show their best models outperform existing approaches in multi-turn dialogue on engagingness and humanness measurements.
The paper discusses the limitations of their work by analyzing failure cases of their models. They note that their models still display a lack of in-depth knowledge if sufficiently interrogated, a tendency to stick to simpler language, and a tendency to repeat oftused phrases. They suggest that unlikelihood training and retrieve-and-refine mechanisms are potential avenues for fixing these problems. However, their initial experiments with these methods are inconclusive. The authors also discuss future possibilities for alleviating these problems and methods for evaluating them.
The paper also presents a detailed evaluation of their models using various metrics, including perplexity, safety, and human evaluations. They compare their models to existing chatbots such as DialogGPT and Meena, showing that their best models outperform these in terms of engagingness and humanness. They also discuss the limitations of their evaluation setup, including the short length of conversations and the potential for bias in human evaluations. The authors conclude that while their models show promise, there is still much work to be done to create truly effective open-domain chatbots. They also emphasize the importance of releasing models to enable full insight into their capabilities.