[slides] Can AI Assistants Know What They Don't Know%3F

Can AI assistants know what they don't know? This paper explores whether AI assistants can recognize their own knowledge boundaries and express them through natural language. The authors construct a model-specific "I don't know" (Idk) dataset containing questions the assistant knows and doesn't know, based on existing open-domain question answering datasets. They align the assistant with the Idk dataset and observe whether it can refuse to answer unknown questions. Experimental results show that after alignment, the assistant can refuse to answer most unknown questions and significantly improves accuracy on known questions. The paper discusses various methods to teach AI assistants to recognize their knowledge, including prompting, supervised fine-tuning, and preference-aware optimization. The results show that these methods improve the assistant's ability to distinguish between known and unknown questions, increasing the truthful rate. The Ik threshold, which determines the confidence level required for an assistant to know an answer, also influences the assistant's behavior. A higher Ik threshold leads to more truthful responses but may result in the assistant refusing to answer more questions. The study also evaluates the performance of different models on out-of-distribution data and finds that larger models are more effective at distinguishing between known and unknown questions. The results indicate that aligning AI assistants with an Idk dataset can significantly improve their truthfulness by helping them recognize their knowledge boundaries and refuse to answer questions they don't know. The paper concludes that this approach is essential for developing truthful AI assistants.Can AI assistants know what they don't know? This paper explores whether AI assistants can recognize their own knowledge boundaries and express them through natural language. The authors construct a model-specific "I don't know" (Idk) dataset containing questions the assistant knows and doesn't know, based on existing open-domain question answering datasets. They align the assistant with the Idk dataset and observe whether it can refuse to answer unknown questions. Experimental results show that after alignment, the assistant can refuse to answer most unknown questions and significantly improves accuracy on known questions. The paper discusses various methods to teach AI assistants to recognize their knowledge, including prompting, supervised fine-tuning, and preference-aware optimization. The results show that these methods improve the assistant's ability to distinguish between known and unknown questions, increasing the truthful rate. The Ik threshold, which determines the confidence level required for an assistant to know an answer, also influences the assistant's behavior. A higher Ik threshold leads to more truthful responses but may result in the assistant refusing to answer more questions. The study also evaluates the performance of different models on out-of-distribution data and finds that larger models are more effective at distinguishing between known and unknown questions. The results indicate that aligning AI assistants with an Idk dataset can significantly improve their truthfulness by helping them recognize their knowledge boundaries and refuse to answer questions they don't know. The paper concludes that this approach is essential for developing truthful AI assistants.

Can AI Assistants Know What They Don't Know?

28 Jan 2024 | Qinyuan Cheng, Tianxiang Sun, Xiangyang Liu, Wenwei Zhang, Zhangyue Yin, Shimin Li, Linyang Li, Zhengfu He, Kai Chen, Xipeng Qiu