Unfamiliar Finetuning Examples Control How Language Models Hallucinate

Unfamiliar Finetuning Examples Control How Language Models Hallucinate

28 May 2024 | Katie Kang, Eric Wallace, Claire Tomlin, Aviral Kumar, Sergey Levine
This paper investigates the mechanisms behind how large language models (LLMs) hallucinate when faced with unfamiliar queries. The authors find that LLMs tend to mimic the responses associated with unfamiliar examples in their finetuning data, which introduces concepts beyond the model's initial knowledge. They conduct controlled experiments using SFT, RL, and reward model finetuning on datasets like TriviaQA and MMLU to validate this hypothesis. The results show that by modifying how unfamiliar examples are supervised, the model's predictions for unfamiliar queries can be steered towards more factually accurate responses. The paper also explores the impact of reward model hallucinations on RL factuality finetuning, proposing a method to learn more reliable reward models that avoid overestimating rewards for unfamiliar inputs. This approach significantly improves the efficacy of RL factuality finetuning in long-form generation tasks, such as biography and book/movie plot generation. The authors conclude by highlighting the importance of understanding and controlling hallucinations to build more trustworthy and reliable LLMs.This paper investigates the mechanisms behind how large language models (LLMs) hallucinate when faced with unfamiliar queries. The authors find that LLMs tend to mimic the responses associated with unfamiliar examples in their finetuning data, which introduces concepts beyond the model's initial knowledge. They conduct controlled experiments using SFT, RL, and reward model finetuning on datasets like TriviaQA and MMLU to validate this hypothesis. The results show that by modifying how unfamiliar examples are supervised, the model's predictions for unfamiliar queries can be steered towards more factually accurate responses. The paper also explores the impact of reward model hallucinations on RL factuality finetuning, proposing a method to learn more reliable reward models that avoid overestimating rewards for unfamiliar inputs. This approach significantly improves the efficacy of RL factuality finetuning in long-form generation tasks, such as biography and book/movie plot generation. The authors conclude by highlighting the importance of understanding and controlling hallucinations to build more trustworthy and reliable LLMs.
Reach us at info@study.space
[slides and audio] Unfamiliar Finetuning Examples Control How Language Models Hallucinate