11 May 2024 | Fang Liu*, Yang Liu*, Lin Shi†, Houkun Huang*, Ruifeng Wang*, Zhen Yang‡, Li Zhang*, Zhongqi Li§, Yuchi Ma§
This paper explores and evaluates hallucinations in code generated by Large Language Models (LLMs). Despite the significant advancements in LLMs for code generation, they are prone to generating outputs that deviate from users' intent, exhibit internal inconsistencies, or misalign with factual knowledge. The study conducted a thematic analysis of LLM-generated code to summarize and categorize hallucinations, establishing a comprehensive taxonomy of 5 primary categories and 19 specific types of hallucinations. The taxonomy includes Intent Conflicting, Context Deviation (inconsistency, repetition, and dead code), and Knowledge Conflicting. The analysis revealed that hallucinations can co-occur within a single program and that different LLMs exhibit distinct patterns of hallucinations. The paper also developed HALLUCODE, a benchmark for evaluating LLMs' hallucination recognition and mitigation capabilities. Experiments with HALLUCODE and HumanEval showed that existing LLMs struggle to recognize and mitigate hallucinations, particularly in identifying their types. The findings highlight the need for further research on hallucination evaluation, detection, and mitigation to improve the reliability and effectiveness of code LLMs.This paper explores and evaluates hallucinations in code generated by Large Language Models (LLMs). Despite the significant advancements in LLMs for code generation, they are prone to generating outputs that deviate from users' intent, exhibit internal inconsistencies, or misalign with factual knowledge. The study conducted a thematic analysis of LLM-generated code to summarize and categorize hallucinations, establishing a comprehensive taxonomy of 5 primary categories and 19 specific types of hallucinations. The taxonomy includes Intent Conflicting, Context Deviation (inconsistency, repetition, and dead code), and Knowledge Conflicting. The analysis revealed that hallucinations can co-occur within a single program and that different LLMs exhibit distinct patterns of hallucinations. The paper also developed HALLUCODE, a benchmark for evaluating LLMs' hallucination recognition and mitigation capabilities. Experiments with HALLUCODE and HumanEval showed that existing LLMs struggle to recognize and mitigate hallucinations, particularly in identifying their types. The findings highlight the need for further research on hallucination evaluation, detection, and mitigation to improve the reliability and effectiveness of code LLMs.