Fine-Tuning and Prompt Engineering for Large Language Models-based Code Review Automation

Fine-Tuning and Prompt Engineering for Large Language Models-based Code Review Automation

June 18, 2024 | Chanathip Pornprasit, Chakkrit Tantithamthavorn
This paper investigates the performance of large language models (LLMs) for code review automation, focusing on two approaches: fine-tuning and prompting. The study compares the effectiveness of GPT-3.5 and Magicoder with existing code review automation methods, including Guo et al.'s approach and three other methods (CodeReviewer, TufanoT5, and D-ACT). The results show that fine-tuning GPT-3.5 with zero-shot learning achieves a 73.17% -74.23% higher Exact Match (EM) than Guo et al.'s approach. Additionally, GPT-3.5 with few-shot learning achieves 46.38% -659.09% higher EM than GPT-3.5 with zero-shot learning. The study also finds that few-shot learning without a persona is more effective than zero-shot learning with a persona. The findings suggest that LLMs for code review automation should be fine-tuned for the highest performance, and when data is insufficient, few-shot learning without a persona is recommended. The study contributes insights into practical recommendations and trade-offs for deploying LLMs in code review automation. Keywords: Modern Code Review, Code Review Automation, Large Language Models, GPT-3.5, Few-Shot Learning, Persona.This paper investigates the performance of large language models (LLMs) for code review automation, focusing on two approaches: fine-tuning and prompting. The study compares the effectiveness of GPT-3.5 and Magicoder with existing code review automation methods, including Guo et al.'s approach and three other methods (CodeReviewer, TufanoT5, and D-ACT). The results show that fine-tuning GPT-3.5 with zero-shot learning achieves a 73.17% -74.23% higher Exact Match (EM) than Guo et al.'s approach. Additionally, GPT-3.5 with few-shot learning achieves 46.38% -659.09% higher EM than GPT-3.5 with zero-shot learning. The study also finds that few-shot learning without a persona is more effective than zero-shot learning with a persona. The findings suggest that LLMs for code review automation should be fine-tuned for the highest performance, and when data is insufficient, few-shot learning without a persona is recommended. The study contributes insights into practical recommendations and trade-offs for deploying LLMs in code review automation. Keywords: Modern Code Review, Code Review Automation, Large Language Models, GPT-3.5, Few-Shot Learning, Persona.
Reach us at info@study.space
Understanding Fine-tuning and prompt engineering for large language models-based code review automation