Understanding Towards a Human-like Open-Domain Chatbot

This paper presents Meena, a multi-turn open-domain chatbot trained end-to-end on data mined and filtered from public domain social media conversations. Meena is a 2.6B parameter neural network trained to minimize perplexity of the next token. The paper also proposes a human evaluation metric called Sensibleness and Specificity Average (SSA), which captures key elements of a human-like multi-turn conversation. Experiments show a strong correlation between perplexity and SSA. The best end-to-end trained Meena scores high on SSA (72% on multi-turn evaluation), suggesting that a human-level SSA of 86% is potentially within reach if we can better optimize perplexity. The full version of Meena (with a filtering mechanism and tuned decoding) scores 79% SSA, 23% higher in absolute SSA than existing chatbots. The paper evaluates chatbots using two types of human evaluation: static and interactive. Static evaluation involves benchmarking models on a fixed set of multi-turn contexts, while interactive evaluation allows humans to chat freely with chatbots. The SSA metric combines two fundamental aspects of a human-like chatbot: making sense and being specific. Human judges label model responses on these two criteria. The SSA metric is a proxy for human likeness and penalizes chatbots that consistently produce generic responses. The paper also discusses weaknesses of the methodology, such as the static evaluation dataset being too restricted to capture all aspects of human conversations. However, the fact that Meena achieves a high SSA score and there is a correlation between SSA and perplexity suggests that a human-like chatbot in terms of sensibleness and specificity could be in sight if we can attain better perplexity. The paper's contributions include: (1) proposing a simple human evaluation metric for multi-turn open-domain chatbots that captures basic, but important, attributes of human conversation; (2) showing evidence that perplexity is an automatic metric that correlates with human judgment, in contrast to recent findings on other automatic metrics; (3) demonstrating that an end-to-end neural model with sufficiently low perplexity can surpass the sensibleness and specificity of existing chatbots that rely on complex, handcrafted frameworks.This paper presents Meena, a multi-turn open-domain chatbot trained end-to-end on data mined and filtered from public domain social media conversations. Meena is a 2.6B parameter neural network trained to minimize perplexity of the next token. The paper also proposes a human evaluation metric called Sensibleness and Specificity Average (SSA), which captures key elements of a human-like multi-turn conversation. Experiments show a strong correlation between perplexity and SSA. The best end-to-end trained Meena scores high on SSA (72% on multi-turn evaluation), suggesting that a human-level SSA of 86% is potentially within reach if we can better optimize perplexity. The full version of Meena (with a filtering mechanism and tuned decoding) scores 79% SSA, 23% higher in absolute SSA than existing chatbots. The paper evaluates chatbots using two types of human evaluation: static and interactive. Static evaluation involves benchmarking models on a fixed set of multi-turn contexts, while interactive evaluation allows humans to chat freely with chatbots. The SSA metric combines two fundamental aspects of a human-like chatbot: making sense and being specific. Human judges label model responses on these two criteria. The SSA metric is a proxy for human likeness and penalizes chatbots that consistently produce generic responses. The paper also discusses weaknesses of the methodology, such as the static evaluation dataset being too restricted to capture all aspects of human conversations. However, the fact that Meena achieves a high SSA score and there is a correlation between SSA and perplexity suggests that a human-like chatbot in terms of sensibleness and specificity could be in sight if we can attain better perplexity. The paper's contributions include: (1) proposing a simple human evaluation metric for multi-turn open-domain chatbots that captures basic, but important, attributes of human conversation; (2) showing evidence that perplexity is an automatic metric that correlates with human judgment, in contrast to recent findings on other automatic metrics; (3) demonstrating that an end-to-end neural model with sufficiently low perplexity can surpass the sensibleness and specificity of existing chatbots that rely on complex, handcrafted frameworks.

Towards a Human-like Open-Domain Chatbot

27 Feb 2020 | Daniel Adiwardana Minh-Thang Luong David R. So Jamie Hall Noah Fiedel Romal Thoppilan Zi Yang Apoorv Kulshreshtha Gaurav Nemade Yifeng Lu Quoc V. Le