28 May 2024 | Katie Kang¹, Eric Wallace¹, Claire Tomlin¹, Aviral Kumar², Sergey Levine¹
Large language models (LLMs) often hallucinate when faced with unfamiliar queries, but the mechanisms behind this behavior are not fully understood. This study finds that unfamiliar examples in the models' finetuning data—those introducing concepts beyond the base model's knowledge—are crucial in shaping these errors. The research shows that LLMs' hallucinated predictions tend to mirror responses from their unfamiliar finetuning examples. By modifying how these examples are supervised, the model's responses to unfamiliar queries can be influenced, such as saying "I don't know."
The study validates this through controlled experiments on TriviaQA and MMLU using SFT, RL, and reward model finetuning. It also investigates RL strategies to improve factuality in long-form generations. While reward model hallucinations can undermine RL factuality finetuning, strategically controlling them can minimize these effects. The study proposes a method for learning more reliable reward models, which improve the efficacy of RL factuality finetuning in tasks like biography and book/movie plot generation.
The work makes two main contributions: (1) a conceptual model explaining factors influencing finetuned LLM predictions for unfamiliar queries, and (2) a more reliable approach to RL factuality finetuning for long-form generation. The findings suggest that by strategically manipulating unfamiliar finetuning examples, models can be guided to produce more accurate responses. The study also highlights the importance of controlling reward model hallucinations, particularly through conservative reward models that avoid overestimating rewards for unfamiliar inputs. These models significantly reduce the adverse effects of reward hallucinations, leading to more factual long-form responses. The research contributes to a better understanding of LLM hallucination mechanisms and principles for controlling them.Large language models (LLMs) often hallucinate when faced with unfamiliar queries, but the mechanisms behind this behavior are not fully understood. This study finds that unfamiliar examples in the models' finetuning data—those introducing concepts beyond the base model's knowledge—are crucial in shaping these errors. The research shows that LLMs' hallucinated predictions tend to mirror responses from their unfamiliar finetuning examples. By modifying how these examples are supervised, the model's responses to unfamiliar queries can be influenced, such as saying "I don't know."
The study validates this through controlled experiments on TriviaQA and MMLU using SFT, RL, and reward model finetuning. It also investigates RL strategies to improve factuality in long-form generations. While reward model hallucinations can undermine RL factuality finetuning, strategically controlling them can minimize these effects. The study proposes a method for learning more reliable reward models, which improve the efficacy of RL factuality finetuning in tasks like biography and book/movie plot generation.
The work makes two main contributions: (1) a conceptual model explaining factors influencing finetuned LLM predictions for unfamiliar queries, and (2) a more reliable approach to RL factuality finetuning for long-form generation. The findings suggest that by strategically manipulating unfamiliar finetuning examples, models can be guided to produce more accurate responses. The study also highlights the importance of controlling reward model hallucinations, particularly through conservative reward models that avoid overestimating rewards for unfamiliar inputs. These models significantly reduce the adverse effects of reward hallucinations, leading to more factual long-form responses. The research contributes to a better understanding of LLM hallucination mechanisms and principles for controlling them.