CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation

2 Sep 2021 | Yue Wang1, Weishi Wang12, Shafiq Joty12, and Steven C.H. Hoi1
CodeT5 is a unified pre-trained encoder-decoder model designed for code understanding and generation. It leverages developer-assigned identifiers to better capture code semantics and introduces a novel identifier-aware pre-training task to distinguish and recover identifiers. Additionally, it uses bimodal dual generation to improve natural language-programming language alignment. CodeT5 outperforms prior methods on code understanding and generation tasks, including defect detection, clone detection, and code summarization, across various programming languages. The model supports multi-task learning and is pre-trained on large-scale code data, including CodeSearchNet and additional datasets from GitHub. CodeT5's identifier-aware pre-training and bimodal dual generation tasks significantly enhance its ability to understand and generate code. The model is released for public use and has shown state-of-the-art performance on the CodeXGLUE benchmark. The paper also discusses ethical considerations, including dataset bias, computational costs, and security implications of using CodeT5 in real-world applications.CodeT5 is a unified pre-trained encoder-decoder model designed for code understanding and generation. It leverages developer-assigned identifiers to better capture code semantics and introduces a novel identifier-aware pre-training task to distinguish and recover identifiers. Additionally, it uses bimodal dual generation to improve natural language-programming language alignment. CodeT5 outperforms prior methods on code understanding and generation tasks, including defect detection, clone detection, and code summarization, across various programming languages. The model supports multi-task learning and is pre-trained on large-scale code data, including CodeSearchNet and additional datasets from GitHub. CodeT5's identifier-aware pre-training and bimodal dual generation tasks significantly enhance its ability to understand and generate code. The model is released for public use and has shown state-of-the-art performance on the CodeXGLUE benchmark. The paper also discusses ethical considerations, including dataset bias, computational costs, and security implications of using CodeT5 in real-world applications.
Reach us at info@study.space
Understanding CodeT5%3A Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation