Exploring and Evaluating Hallucinations in LLM-Powered Code Generation

Exploring and Evaluating Hallucinations in LLM-Powered Code Generation

11 May 2024 | Fang Liu*, Yang Liu†, Lin Shi†, Houkun Huang*, Ruifeng Wang*, Zhen Yang†, Li Zhang*, Zhongqi Li‡, Yuchi Ma‡
This paper investigates hallucinations in code generation by large language models (LLMs). We conducted a thematic analysis of code generated by various LLMs to categorize and understand the types of hallucinations present. Our study established a comprehensive taxonomy of hallucinations in code generation, identifying five primary categories: Intent Conflicting, Context Inconsistency, Context Repetition, Dead Code, and Knowledge Conflicting. These categories encompass 19 specific types of hallucinations. We also analyzed the distribution of hallucinations across different LLMs and their correlation with code correctness. Based on these findings, we proposed HALLUCODE, a benchmark for evaluating the performance of code LLMs in recognizing hallucinations. Experiments with HALLUCODE and HumanEval show that existing LLMs face significant challenges in recognizing and mitigating hallucinations, particularly in identifying their types. Our findings highlight the importance of evaluating hallucination recognition and mitigation in code LLMs to build more effective and reliable models. The study also emphasizes the need for further research into the characteristics and patterns of hallucinations in different code generation tasks.This paper investigates hallucinations in code generation by large language models (LLMs). We conducted a thematic analysis of code generated by various LLMs to categorize and understand the types of hallucinations present. Our study established a comprehensive taxonomy of hallucinations in code generation, identifying five primary categories: Intent Conflicting, Context Inconsistency, Context Repetition, Dead Code, and Knowledge Conflicting. These categories encompass 19 specific types of hallucinations. We also analyzed the distribution of hallucinations across different LLMs and their correlation with code correctness. Based on these findings, we proposed HALLUCODE, a benchmark for evaluating the performance of code LLMs in recognizing hallucinations. Experiments with HALLUCODE and HumanEval show that existing LLMs face significant challenges in recognizing and mitigating hallucinations, particularly in identifying their types. Our findings highlight the importance of evaluating hallucination recognition and mitigation in code LLMs to build more effective and reliable models. The study also emphasizes the need for further research into the characteristics and patterns of hallucinations in different code generation tasks.
Reach us at info@study.space
[slides] Exploring and Evaluating Hallucinations in LLM-Powered Code Generation | StudySpace