30 May 2024 | Varun Magesh, Faiz Surani, Matthew Dahl, Mirac Suzgun, Christopher D. Manning, Daniel E. Ho
This paper evaluates the reliability of leading AI legal research tools, focusing on their ability to avoid "hallucinations"—false information generated by AI systems. The study assesses the performance of three RAG-based AI tools: LexisNexis's Lexis+ AI, Thomson Reuters's Ask Practical Law AI, and Westlaw's AI-Assisted Research, as well as GPT-4. The findings reveal that while these tools reduce hallucinations compared to general-purpose chatbots like GPT-4, they still produce false information in significant proportions. Lexis+ AI hallucinates between 17% and 33% of the time, while Westlaw AI-Assisted Research and Ask Practical Law AI have even higher rates. The study also highlights substantial differences in system performance, with Lexis+ AI being the most accurate, followed by Westlaw and then Ask Practical Law AI. The research introduces a comprehensive dataset for evaluating legal AI tools and proposes a typology to differentiate between hallucinations and accurate legal responses. It also emphasizes the need for legal professionals to supervise and verify AI outputs, as the responsible integration of AI into law remains a critical challenge. The study underscores the limitations of current AI technologies and the importance of understanding the reasons behind their failures. The results provide empirical evidence to inform the ethical and practical considerations of using AI in legal practice.This paper evaluates the reliability of leading AI legal research tools, focusing on their ability to avoid "hallucinations"—false information generated by AI systems. The study assesses the performance of three RAG-based AI tools: LexisNexis's Lexis+ AI, Thomson Reuters's Ask Practical Law AI, and Westlaw's AI-Assisted Research, as well as GPT-4. The findings reveal that while these tools reduce hallucinations compared to general-purpose chatbots like GPT-4, they still produce false information in significant proportions. Lexis+ AI hallucinates between 17% and 33% of the time, while Westlaw AI-Assisted Research and Ask Practical Law AI have even higher rates. The study also highlights substantial differences in system performance, with Lexis+ AI being the most accurate, followed by Westlaw and then Ask Practical Law AI. The research introduces a comprehensive dataset for evaluating legal AI tools and proposes a typology to differentiate between hallucinations and accurate legal responses. It also emphasizes the need for legal professionals to supervise and verify AI outputs, as the responsible integration of AI into law remains a critical challenge. The study underscores the limitations of current AI technologies and the importance of understanding the reasons behind their failures. The results provide empirical evidence to inform the ethical and practical considerations of using AI in legal practice.