May 22-27, 2022 | Stephanie Lin, Jacob Hilton, Owain Evans
TruthfulQA is a benchmark designed to measure how truthful language models are in generating answers to questions. The benchmark includes 817 questions across 38 categories, including health, law, finance, and politics. These questions are crafted to elicit imitative falsehoods, where models may generate false answers that mimic human misconceptions. The benchmark tests models like GPT-3, GPT-Neo/J, GPT-2, and T5-based models. The best-performing model, GPT-3-175B with a "helpful" prompt, achieved 58% truthfulness, while humans scored 94%. Models often generate false answers that mimic popular misconceptions, and larger models tend to be less truthful, contradicting the trend seen in other NLP tasks. This suggests that scaling up models alone is less effective for improving truthfulness than fine-tuning with training objectives other than imitation of text. TruthfulQA highlights the importance of measuring truthfulness in models, as false answers can lead to deception and distrust. The benchmark also shows that automated metrics can predict human evaluations with high accuracy, providing a useful tool for assessing model performance. Overall, TruthfulQA emphasizes the need for models that are both truthful and informative, especially in critical applications like medicine, law, and science.TruthfulQA is a benchmark designed to measure how truthful language models are in generating answers to questions. The benchmark includes 817 questions across 38 categories, including health, law, finance, and politics. These questions are crafted to elicit imitative falsehoods, where models may generate false answers that mimic human misconceptions. The benchmark tests models like GPT-3, GPT-Neo/J, GPT-2, and T5-based models. The best-performing model, GPT-3-175B with a "helpful" prompt, achieved 58% truthfulness, while humans scored 94%. Models often generate false answers that mimic popular misconceptions, and larger models tend to be less truthful, contradicting the trend seen in other NLP tasks. This suggests that scaling up models alone is less effective for improving truthfulness than fine-tuning with training objectives other than imitation of text. TruthfulQA highlights the importance of measuring truthfulness in models, as false answers can lead to deception and distrust. The benchmark also shows that automated metrics can predict human evaluations with high accuracy, providing a useful tool for assessing model performance. Overall, TruthfulQA emphasizes the need for models that are both truthful and informative, especially in critical applications like medicine, law, and science.