2 Sep 2021 | Yue Wang1, Weishi Wang12, Shafiq Joty12, and Steven C.H. Hoi1
CodeT5 is a unified pre-trained encoder-decoder Transformer model designed to enhance code understanding and generation tasks. It leverages developer-assigned identifiers to better capture code semantics and employs a novel identifier-aware pre-training task to distinguish and recover identifiers when they are masked. Additionally, CodeT5 proposes a bimodal dual generation task to improve the alignment between natural language (NL) and programming language (PL) through the use of code comments. Comprehensive experiments demonstrate that CodeT5 outperforms prior methods on various code-related tasks, including code defect detection, clone detection, and generation across different directions (PL-NL, NL-PL, and PL-PL). The model's ability to better capture semantic information from code is further validated through detailed analysis. Code and pre-trained models are available at <https://github.com/salesforce/CodeT5>.CodeT5 is a unified pre-trained encoder-decoder Transformer model designed to enhance code understanding and generation tasks. It leverages developer-assigned identifiers to better capture code semantics and employs a novel identifier-aware pre-training task to distinguish and recover identifiers when they are masked. Additionally, CodeT5 proposes a bimodal dual generation task to improve the alignment between natural language (NL) and programming language (PL) through the use of code comments. Comprehensive experiments demonstrate that CodeT5 outperforms prior methods on various code-related tasks, including code defect detection, clone detection, and generation across different directions (PL-NL, NL-PL, and PL-PL). The model's ability to better capture semantic information from code is further validated through detailed analysis. Code and pre-trained models are available at <https://github.com/salesforce/CodeT5>.