April 25, 2024 | Matthew Dahl, Varun Magesh, Mirac Suzgun, Daniel E. Ho
Large language models (LLMs) often produce false legal information, a phenomenon known as "legal hallucinations." This study examines the extent and nature of these hallucinations in public-facing LLMs, including OpenAI’s ChatGPT 4, Google’s PaLM 2, and Meta’s Llama 2. The research finds that LLMs hallucinate at least 58% of the time when asked direct, verifiable questions about federal court cases, with Llama 2 hallucinating up to 88% of the time. These hallucinations are common across all LLMs when asked about federal court cases, but GPT 4 performs best overall.
The study develops a typology of legal hallucinations, distinguishing between closed-domain (intrinsic) hallucinations, open-domain (extrinsic) hallucinations, and factual infidelity. Closed-domain hallucinations occur when an LLM's response is unfaithful to or in conflict with the input prompt. Open-domain hallucinations occur when the response contradicts or does not derive from the training corpus. Factual infidelity occurs when the response lacks fidelity to the facts of the world.
The study also finds that LLMs are susceptible to contra-factual bias, i.e., their ability to respond to queries based on mistaken legal premises. Additionally, LLMs often struggle to accurately gauge their own level of certainty, leading to self-awareness issues. The research highlights the need for caution in the rapid and unsupervised integration of LLMs into legal tasks, as their current shortcomings significantly hinder their effectiveness in legal settings.
The study's findings suggest that LLMs are not reliable sources of legal knowledge, and their use in legal contexts requires careful consideration. The results also indicate that LLMs may produce a falsely homogeneous sense of the legal landscape, collapsing important legal nuances and perpetuating representational harms. The study concludes that while LLMs have the potential to make legal information and services more accessible and affordable, their current limitations in generating accurate and reliable legal statements significantly hinder this objective.Large language models (LLMs) often produce false legal information, a phenomenon known as "legal hallucinations." This study examines the extent and nature of these hallucinations in public-facing LLMs, including OpenAI’s ChatGPT 4, Google’s PaLM 2, and Meta’s Llama 2. The research finds that LLMs hallucinate at least 58% of the time when asked direct, verifiable questions about federal court cases, with Llama 2 hallucinating up to 88% of the time. These hallucinations are common across all LLMs when asked about federal court cases, but GPT 4 performs best overall.
The study develops a typology of legal hallucinations, distinguishing between closed-domain (intrinsic) hallucinations, open-domain (extrinsic) hallucinations, and factual infidelity. Closed-domain hallucinations occur when an LLM's response is unfaithful to or in conflict with the input prompt. Open-domain hallucinations occur when the response contradicts or does not derive from the training corpus. Factual infidelity occurs when the response lacks fidelity to the facts of the world.
The study also finds that LLMs are susceptible to contra-factual bias, i.e., their ability to respond to queries based on mistaken legal premises. Additionally, LLMs often struggle to accurately gauge their own level of certainty, leading to self-awareness issues. The research highlights the need for caution in the rapid and unsupervised integration of LLMs into legal tasks, as their current shortcomings significantly hinder their effectiveness in legal settings.
The study's findings suggest that LLMs are not reliable sources of legal knowledge, and their use in legal contexts requires careful consideration. The results also indicate that LLMs may produce a falsely homogeneous sense of the legal landscape, collapsing important legal nuances and perpetuating representational harms. The study concludes that while LLMs have the potential to make legal information and services more accessible and affordable, their current limitations in generating accurate and reliable legal statements significantly hinder this objective.