This paper proposes a novel method called Mutation-based Consistency Testing (MCT) to systematically evaluate the code understanding capability of Large Language Models (LLMs), particularly focusing on subtle inconsistencies between code and its natural language description. The method introduces code mutations to existing code generation datasets to create mismatches between code and its description. Different types of code mutations, such as operator replacement and statement deletion, are applied to generate inconsistent code-description pairs. These pairs are then used to test the ability of LLMs to detect inconsistencies.
The MCT method is applied to two popular LLMs, GPT-3.5 and GPT-4, using the HumanEval-X benchmark, which includes six programming languages. The results show that GPT-4 significantly outperforms GPT-3.5 in terms of MCT scores, although GPT-4 also has weaknesses in relational logic and Java programs. GPT-3.5's performance can be greatly improved with one-shot prompts. The study also investigates how different mutation operators and programming languages affect the performance of the LLMs. The results indicate that the MCT method provides valuable insights into the strengths and weaknesses of LLMs in understanding code semantics.
The paper also presents a case study that demonstrates the applicability of MCT on GPT-3.5 and GPT-4. The results show that the MCT method can effectively identify conditions under which LLMs result in correct or incorrect answers. The study further explores the impact of prompt engineering, showing that one-shot prompts significantly improve the performance of GPT-3.5. The findings highlight the importance of prompt engineering in enhancing the accuracy and adaptability of LLMs in complex tasks. The paper concludes that MCT provides a systematic way to evaluate the code understanding capability of LLMs and offers valuable implications for future research and development of LLM-based software engineering.This paper proposes a novel method called Mutation-based Consistency Testing (MCT) to systematically evaluate the code understanding capability of Large Language Models (LLMs), particularly focusing on subtle inconsistencies between code and its natural language description. The method introduces code mutations to existing code generation datasets to create mismatches between code and its description. Different types of code mutations, such as operator replacement and statement deletion, are applied to generate inconsistent code-description pairs. These pairs are then used to test the ability of LLMs to detect inconsistencies.
The MCT method is applied to two popular LLMs, GPT-3.5 and GPT-4, using the HumanEval-X benchmark, which includes six programming languages. The results show that GPT-4 significantly outperforms GPT-3.5 in terms of MCT scores, although GPT-4 also has weaknesses in relational logic and Java programs. GPT-3.5's performance can be greatly improved with one-shot prompts. The study also investigates how different mutation operators and programming languages affect the performance of the LLMs. The results indicate that the MCT method provides valuable insights into the strengths and weaknesses of LLMs in understanding code semantics.
The paper also presents a case study that demonstrates the applicability of MCT on GPT-3.5 and GPT-4. The results show that the MCT method can effectively identify conditions under which LLMs result in correct or incorrect answers. The study further explores the impact of prompt engineering, showing that one-shot prompts significantly improve the performance of GPT-3.5. The findings highlight the importance of prompt engineering in enhancing the accuracy and adaptability of LLMs in complex tasks. The paper concludes that MCT provides a systematic way to evaluate the code understanding capability of LLMs and offers valuable implications for future research and development of LLM-based software engineering.