27 Feb 2020 | Daniel Adiwardana Minh-Thang Luong David R. So Jamie Hall Noah Fiedel Romal Thoppilan Zi Yang Apoorv Kulshreshtha Gaurav Nemade Yifeng Lu Quoc V. Le
The paper presents Meena, a multi-turn open-domain chatbot trained end-to-end on data mined and filtered from public domain social media conversations. The 2.6B parameter neural network is trained to minimize perplexity of the next token. A new human evaluation metric called Sensibleness and Specificity Average (SSA) is proposed, which captures key elements of a human-like multi-turn conversation. Experiments show a strong correlation between perplexity and SSA. The best end-to-end trained Meena scores 72% SSA, suggesting that achieving human-level SSA of 86% is potentially within reach with better optimization of perplexity. The full version of Meena, with a filtering mechanism and tuned decoding, scores 79% SSA, 23% higher than existing chatbots evaluated. The paper also discusses the limitations of the methodology and contributes to the field by proposing a simple human evaluation metric, demonstrating the correlation between perplexity and human judgment, and showing that an end-to-end neural model with low perplexity can surpass existing chatbots in terms of sensibleness and specificity.The paper presents Meena, a multi-turn open-domain chatbot trained end-to-end on data mined and filtered from public domain social media conversations. The 2.6B parameter neural network is trained to minimize perplexity of the next token. A new human evaluation metric called Sensibleness and Specificity Average (SSA) is proposed, which captures key elements of a human-like multi-turn conversation. Experiments show a strong correlation between perplexity and SSA. The best end-to-end trained Meena scores 72% SSA, suggesting that achieving human-level SSA of 86% is potentially within reach with better optimization of perplexity. The full version of Meena, with a filtering mechanism and tuned decoding, scores 79% SSA, 23% higher than existing chatbots evaluated. The paper also discusses the limitations of the methodology and contributes to the field by proposing a simple human evaluation metric, demonstrating the correlation between perplexity and human judgment, and showing that an end-to-end neural model with low perplexity can surpass existing chatbots in terms of sensibleness and specificity.