3 May 2020 | Zhengbao Jiang1*, Frank F. Xu1*, Jun Araki2, Graham Neubig1
This paper investigates how to more accurately estimate the knowledge contained in language models (LMs) by automatically generating better prompts to query them. Previous methods rely on manually created prompts, which may not be optimal, leading to lower bounds on the knowledge LMs possess. To address this, the authors propose mining-based and paraphrasing-based methods to generate high-quality, diverse prompts, as well as ensemble methods to combine answers from different prompts. They evaluate their methods on the LAMA benchmark for extracting relational knowledge from LMs, achieving a significant improvement in accuracy from 31.1% to 39.6%. The results show that the proposed methods provide a tighter lower bound on what LMs know. The authors also release the LM Prompt And Query Archive (LPAQA) to facilitate future experiments on probing knowledge in LMs. The paper discusses various prompt generation techniques, including mining-based prompts derived from Wikipedia and paraphrasing-based prompts generated through back-translation. They also explore prompt selection and ensembling strategies, demonstrating that diverse prompts can improve performance. The results show that optimized ensembles outperform rank-based ensembles, and that even small modifications to prompts can lead to significant accuracy improvements. The study also examines the performance of different LMs, including BERT, ERNIE, and KnowBert, and finds that optimized prompts can significantly improve accuracy across models. The authors conclude that their methods provide a more accurate estimate of the knowledge contained in LMs and that further research is needed to improve knowledge retrieval from LMs.This paper investigates how to more accurately estimate the knowledge contained in language models (LMs) by automatically generating better prompts to query them. Previous methods rely on manually created prompts, which may not be optimal, leading to lower bounds on the knowledge LMs possess. To address this, the authors propose mining-based and paraphrasing-based methods to generate high-quality, diverse prompts, as well as ensemble methods to combine answers from different prompts. They evaluate their methods on the LAMA benchmark for extracting relational knowledge from LMs, achieving a significant improvement in accuracy from 31.1% to 39.6%. The results show that the proposed methods provide a tighter lower bound on what LMs know. The authors also release the LM Prompt And Query Archive (LPAQA) to facilitate future experiments on probing knowledge in LMs. The paper discusses various prompt generation techniques, including mining-based prompts derived from Wikipedia and paraphrasing-based prompts generated through back-translation. They also explore prompt selection and ensembling strategies, demonstrating that diverse prompts can improve performance. The results show that optimized ensembles outperform rank-based ensembles, and that even small modifications to prompts can lead to significant accuracy improvements. The study also examines the performance of different LMs, including BERT, ERNIE, and KnowBert, and finds that optimized prompts can significantly improve accuracy across models. The authors conclude that their methods provide a more accurate estimate of the knowledge contained in LMs and that further research is needed to improve knowledge retrieval from LMs.