Understanding Fine-tuning and prompt engineering for large language models-based code review automation

The paper investigates the performance of Large Language Models (LLMs) for code review automation, focusing on two approaches: fine-tuning and prompt engineering. The authors aim to determine the most effective methods for leveraging LLMs in this context. They evaluate two LLMs, GPT-3.5 and Magicoder, using various evaluation metrics such as Exact Match (EM) and CodeBLEU on three code review datasets: CodeReviewerdata, Tufano data, and D-ACTdata. The study finds that fine-tuning GPT-3.5 with zero-shot learning significantly improves performance, achieving 73.17% to 74.23% higher EM compared to the baseline approach. Additionally, fine-tuning GPT-3.5 with few-shot learning further enhances performance, achieving 63.91% to 1,100% higher EM. The results also suggest that using few-shot learning without a persona is the most effective prompting strategy when data is insufficient for fine-tuning. The paper concludes with recommendations for practitioners, emphasizing the importance of fine-tuning LLMs for optimal performance and the use of few-shot learning without a persona when data is limited.The paper investigates the performance of Large Language Models (LLMs) for code review automation, focusing on two approaches: fine-tuning and prompt engineering. The authors aim to determine the most effective methods for leveraging LLMs in this context. They evaluate two LLMs, GPT-3.5 and Magicoder, using various evaluation metrics such as Exact Match (EM) and CodeBLEU on three code review datasets: CodeReviewerdata, Tufano data, and D-ACTdata. The study finds that fine-tuning GPT-3.5 with zero-shot learning significantly improves performance, achieving 73.17% to 74.23% higher EM compared to the baseline approach. Additionally, fine-tuning GPT-3.5 with few-shot learning further enhances performance, achieving 63.91% to 1,100% higher EM. The results also suggest that using few-shot learning without a persona is the most effective prompting strategy when data is insufficient for fine-tuning. The paper concludes with recommendations for practitioners, emphasizing the importance of fine-tuning LLMs for optimal performance and the use of few-shot learning without a persona when data is limited.

Fine-Tuning and Prompt Engineering for Large Language Models-based Code Review Automation

June 18, 2024 | Chanathip Pornprasit, Chakkrit Tantithamthavorn