13 Sep 2021 | Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, Michele Tufano, Shao Kun Deng, Colin Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Dixin Jiang, and Ming Zhou
GraphCodeBERT is a pre-trained model designed to enhance the understanding of programming code by leveraging its inherent structure, specifically data flow. Unlike traditional models that treat code as a sequence of tokens, GraphCodeBERT uses data flow, which represents the semantic relationships between variables, to capture the structure of code. This approach is less complex and more efficient compared to using abstract syntax trees (ASTs). The model is based on the Transformer architecture and incorporates two new structure-aware pre-training tasks: predicting code structure edges and aligning representations between source code and code structure. Experiments on four downstream tasks—code search, clone detection, code translation, and code refinement—show that GraphCodeBERT achieves state-of-the-art performance, demonstrating the effectiveness of incorporating code structure into pre-training. Further analysis reveals that the model prefers structure-level attentions over token-level attentions in code search tasks.GraphCodeBERT is a pre-trained model designed to enhance the understanding of programming code by leveraging its inherent structure, specifically data flow. Unlike traditional models that treat code as a sequence of tokens, GraphCodeBERT uses data flow, which represents the semantic relationships between variables, to capture the structure of code. This approach is less complex and more efficient compared to using abstract syntax trees (ASTs). The model is based on the Transformer architecture and incorporates two new structure-aware pre-training tasks: predicting code structure edges and aligning representations between source code and code structure. Experiments on four downstream tasks—code search, clone detection, code translation, and code refinement—show that GraphCodeBERT achieves state-of-the-art performance, demonstrating the effectiveness of incorporating code structure into pre-training. Further analysis reveals that the model prefers structure-level attentions over token-level attentions in code search tasks.