November 16 - 20, 2020 | Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, Ming Zhou
CodeBERT is a bimodal pre-trained model designed to bridge the gap between natural language (NL) and programming language (PL). It leverages both NL-PL pairs and unimodal data to learn general-purpose representations that support downstream tasks such as natural language code search and code documentation generation. The model is trained using a hybrid objective function that includes masked language modeling (MLM) and replaced token detection (RTD). CodeBERT achieves state-of-the-art performance on these tasks and outperforms previous pre-trained models in a zero-shot setting, demonstrating its effectiveness in NL-PL understanding and generation. The paper also introduces a dataset for NL-PL probing to investigate the knowledge learned by CodeBERT, showing that it consistently outperforms RoBERTa, a purely natural language-based pre-trained model.CodeBERT is a bimodal pre-trained model designed to bridge the gap between natural language (NL) and programming language (PL). It leverages both NL-PL pairs and unimodal data to learn general-purpose representations that support downstream tasks such as natural language code search and code documentation generation. The model is trained using a hybrid objective function that includes masked language modeling (MLM) and replaced token detection (RTD). CodeBERT achieves state-of-the-art performance on these tasks and outperforms previous pre-trained models in a zero-shot setting, demonstrating its effectiveness in NL-PL understanding and generation. The paper also introduces a dataset for NL-PL probing to investigate the knowledge learned by CodeBERT, showing that it consistently outperforms RoBERTa, a purely natural language-based pre-trained model.