Understanding Source Code Summarization in the Era of Large Language Models

This paper explores the effectiveness of large language models (LLMs) in code summarization, a task aimed at generating concise natural language summaries for code snippets to enhance understanding and maintenance. The study covers multiple aspects of the LLM-based code summarization workflow, including evaluation methods, prompting techniques, model settings, programming language types, and summary categories. Key findings include: 1. **Evaluation Methods**: The GPT-4-based evaluation method shows the strongest correlation with human evaluation, making it the most suitable for assessing LLM-generated summaries. 2. **Prompting Techniques**: Advanced prompting techniques like chain-of-thought and critique do not necessarily outperform simpler techniques like zero-shot prompting. Zero-shot prompting often performs well, especially with the GPT-3.5 model. 3. **Model Settings**: The impact of top_p and temperature parameters on summary quality varies by LLM and programming language, but they generally have similar effects. 4. **Programming Languages**: LLMs perform suboptimally when summarizing logic programming languages compared to other types (procedural, object-oriented, scripting, functional). 5. **Summary Categories**: CodeLlama-Instruct outperforms GPT-4 in generating summaries for specific categories like "Why" and "Property," suggesting that smaller LLMs can achieve comparable or better performance in certain contexts. The study provides valuable insights for researchers and practitioners in the field of code summarization, particularly in the era of LLMs.This paper explores the effectiveness of large language models (LLMs) in code summarization, a task aimed at generating concise natural language summaries for code snippets to enhance understanding and maintenance. The study covers multiple aspects of the LLM-based code summarization workflow, including evaluation methods, prompting techniques, model settings, programming language types, and summary categories. Key findings include: 1. **Evaluation Methods**: The GPT-4-based evaluation method shows the strongest correlation with human evaluation, making it the most suitable for assessing LLM-generated summaries. 2. **Prompting Techniques**: Advanced prompting techniques like chain-of-thought and critique do not necessarily outperform simpler techniques like zero-shot prompting. Zero-shot prompting often performs well, especially with the GPT-3.5 model. 3. **Model Settings**: The impact of top_p and temperature parameters on summary quality varies by LLM and programming language, but they generally have similar effects. 4. **Programming Languages**: LLMs perform suboptimally when summarizing logic programming languages compared to other types (procedural, object-oriented, scripting, functional). 5. **Summary Categories**: CodeLlama-Instruct outperforms GPT-4 in generating summaries for specific categories like "Why" and "Property," suggesting that smaller LLMs can achieve comparable or better performance in certain contexts. The study provides valuable insights for researchers and practitioners in the field of code summarization, particularly in the era of LLMs.

Source Code Summarization in the Era of Large Language Models

9 Jul 2024 | Weisong Sun, Yun Miao, Yuekang Li, Hongyu Zhang, Chunrong Fang, Yi Liu, Gelei Deng, Yang Liu, Zhenyu Chen