May 22-27, 2022 | Stephanie Lin, Jacob Hilton, Owain Evans
The paper introduces TruthfulQA, a benchmark designed to measure the truthfulness of language models in generating answers to questions. The benchmark consists of 817 questions across 38 categories, crafted to elicit imitative falsehoods—false answers that are likely to be produced by models due to their training distribution. The authors tested various models, including GPT-3, GPT-Neo/J, GPT-2, and a T5-based model, and found that the best model achieved 58% truthfulness, while human performance was 94%. Larger models generally performed worse, contrary to the typical trend in NLP tasks where larger models improve performance. The study also introduced an automated metric, GPT-judge, which achieved 90-96% accuracy in predicting human evaluations of truthfulness. The results highlight the need for models to be more truthful and suggest that fine-tuning with training objectives other than text imitation may be more effective than scaling up models alone.The paper introduces TruthfulQA, a benchmark designed to measure the truthfulness of language models in generating answers to questions. The benchmark consists of 817 questions across 38 categories, crafted to elicit imitative falsehoods—false answers that are likely to be produced by models due to their training distribution. The authors tested various models, including GPT-3, GPT-Neo/J, GPT-2, and a T5-based model, and found that the best model achieved 58% truthfulness, while human performance was 94%. Larger models generally performed worse, contrary to the typical trend in NLP tasks where larger models improve performance. The study also introduced an automated metric, GPT-judge, which achieved 90-96% accuracy in predicting human evaluations of truthfulness. The results highlight the need for models to be more truthful and suggest that fine-tuning with training objectives other than text imitation may be more effective than scaling up models alone.