CodeBERT: A Pre-Trained Model for Programming and Natural Languages

CodeBERT: A Pre-Trained Model for Programming and Natural Languages

November 16-20, 2020 | Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, Ming Zhou
CodeBERT is a pre-trained model that combines natural language (NL) and programming language (PL) understanding. It is designed to support downstream tasks such as natural language code search and code documentation generation. CodeBERT is trained using a hybrid objective function that includes masked language modeling (MLM) and replaced token detection (RTD). It leverages both bimodal data (NL-PL pairs) and unimodal data (codes and NL texts) to learn general-purpose representations. CodeBERT is trained on six programming languages using data from GitHub repositories, including bimodal data with function-level documentation and unimodal code data. The model is based on the Transformer architecture and is fine-tuned for downstream tasks. CodeBERT achieves state-of-the-art performance on both natural language code search and code documentation generation. Additionally, it outperforms previous pre-trained models in NL-PL probing tasks. CodeBERT is also effective in code-to-text generation tasks and generalizes well to programming languages not seen during pre-training. The model is evaluated on multiple tasks, including code search, code documentation generation, and probing tasks, demonstrating its versatility and effectiveness in handling both NL and PL data.CodeBERT is a pre-trained model that combines natural language (NL) and programming language (PL) understanding. It is designed to support downstream tasks such as natural language code search and code documentation generation. CodeBERT is trained using a hybrid objective function that includes masked language modeling (MLM) and replaced token detection (RTD). It leverages both bimodal data (NL-PL pairs) and unimodal data (codes and NL texts) to learn general-purpose representations. CodeBERT is trained on six programming languages using data from GitHub repositories, including bimodal data with function-level documentation and unimodal code data. The model is based on the Transformer architecture and is fine-tuned for downstream tasks. CodeBERT achieves state-of-the-art performance on both natural language code search and code documentation generation. Additionally, it outperforms previous pre-trained models in NL-PL probing tasks. CodeBERT is also effective in code-to-text generation tasks and generalizes well to programming languages not seen during pre-training. The model is evaluated on multiple tasks, including code search, code documentation generation, and probing tasks, demonstrating its versatility and effectiveness in handling both NL and PL data.
Reach us at info@study.space
[slides and audio] CodeBERT%3A A Pre-Trained Model for Programming and Natural Languages