A Survey on Large Language Models for Code Generation

A Survey on Large Language Models for Code Generation

September 2024 | JUYONG JIANG*, The Hong Kong University of Science and Technology (Guangzhou), China FAN WANG*, The Hong Kong University of Science and Technology (Guangzhou), China JIASI SHEN†, The Hong Kong University of Science and Technology, China SUNGJU KIM†, NAVER Cloud, South Korea SUNGHUN KIM†, The Hong Kong University of Science and Technology (Guangzhou), China
This survey provides a comprehensive and up-to-date review of Large Language Models (LLMs) for code generation, addressing the gap in existing literature. It introduces a taxonomy to categorize recent advancements, covering data curation, performance evaluation, and real-world applications. The survey highlights the evolution of LLMs for code generation, including models like ChatGPT, GPT4, LLaMA, and StarCoder, and discusses critical challenges and promising opportunities. Key aspects include the use of Transformer architecture, multi-head self-attention, position-wise feed-forward networks, residual connections, and normalization. The survey also explores data curation and processing, data synthesis techniques, model architectures, pre-training tasks, and evaluation methods. It emphasizes the importance of synthetic data and the role of instruction tuning in enhancing code generation capabilities. The survey concludes with a discussion on practical applications and future directions, aiming to serve as a valuable reference for researchers and practitioners in the field of code generation using LLMs.This survey provides a comprehensive and up-to-date review of Large Language Models (LLMs) for code generation, addressing the gap in existing literature. It introduces a taxonomy to categorize recent advancements, covering data curation, performance evaluation, and real-world applications. The survey highlights the evolution of LLMs for code generation, including models like ChatGPT, GPT4, LLaMA, and StarCoder, and discusses critical challenges and promising opportunities. Key aspects include the use of Transformer architecture, multi-head self-attention, position-wise feed-forward networks, residual connections, and normalization. The survey also explores data curation and processing, data synthesis techniques, model architectures, pre-training tasks, and evaluation methods. It emphasizes the importance of synthetic data and the role of instruction tuning in enhancing code generation capabilities. The survey concludes with a discussion on practical applications and future directions, aiming to serve as a valuable reference for researchers and practitioners in the field of code generation using LLMs.
Reach us at info@study.space