[slides] Mutation-based Consistency Testing for Evaluating the Code Understanding Capability of LLMs

This paper introduces a novel method called Mutation-based Consistency Testing (MCT) to systematically evaluate the code understanding capability of Large Language Models (LLMs), particularly focusing on subtle differences between code and its natural language descriptions. MCT involves introducing code mutations to existing code generation datasets, creating inconsistent code-description pairs, and testing LLMs to detect these inconsistencies. The study uses the HumanEval-X benchmark, which includes six programming languages (Python, C++, Java, Go, JavaScript, and Rust), and evaluates two popular LLMs, GPT-3.5 and GPT-4. The results show significant variation in LLM performance across different mutation types and programming languages, with GPT-4 generally outperforming GPT-3.5. The study also investigates the impact of prompt engineering, finding that one-shot prompts significantly improve LLM performance compared to zero-shot prompts. The findings provide valuable insights for future research and development in LLM-based software engineering.This paper introduces a novel method called Mutation-based Consistency Testing (MCT) to systematically evaluate the code understanding capability of Large Language Models (LLMs), particularly focusing on subtle differences between code and its natural language descriptions. MCT involves introducing code mutations to existing code generation datasets, creating inconsistent code-description pairs, and testing LLMs to detect these inconsistencies. The study uses the HumanEval-X benchmark, which includes six programming languages (Python, C++, Java, Go, JavaScript, and Rust), and evaluates two popular LLMs, GPT-3.5 and GPT-4. The results show significant variation in LLM performance across different mutation types and programming languages, with GPT-4 generally outperforming GPT-3.5. The study also investigates the impact of prompt engineering, finding that one-shot prompts significantly improve LLM performance compared to zero-shot prompts. The findings provide valuable insights for future research and development in LLM-based software engineering.

Mutation-based Consistency Testing for Evaluating the Code Understanding Capability of LLMs

11 Jan 2024 | Ziyu Li, Donghwan Shin