13 Sep 2021 | Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, Michele Tufano, Shao Kun Deng, Colin Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Dixin Jiang, and Ming Zhou
GraphCodeBERT is a pre-trained model for programming languages that considers the inherent structure of code. Unlike previous models that treat code as a sequence of tokens, GraphCodeBERT uses data flow, a semantic-level structure that encodes the "where-the-value-comes-from" relationship between variables. This approach is less complex and avoids unnecessary hierarchy, making the model more efficient. GraphCodeBERT is based on the Transformer architecture and incorporates a graph-guided masked attention function to integrate code structure. Two new structure-aware pre-training tasks are introduced: predicting code structure edges and aligning representations between source code and data flow. The model is pre-trained on the CodeSearchNet dataset and evaluated on four tasks: code search, clone detection, code translation, and code refinement. Results show that GraphCodeBERT achieves state-of-the-art performance on these tasks. Analysis indicates that code structure and the new pre-training tasks significantly improve model performance. The model also shows a preference for structure-level attention over token-level attention in code search tasks.GraphCodeBERT is a pre-trained model for programming languages that considers the inherent structure of code. Unlike previous models that treat code as a sequence of tokens, GraphCodeBERT uses data flow, a semantic-level structure that encodes the "where-the-value-comes-from" relationship between variables. This approach is less complex and avoids unnecessary hierarchy, making the model more efficient. GraphCodeBERT is based on the Transformer architecture and incorporates a graph-guided masked attention function to integrate code structure. Two new structure-aware pre-training tasks are introduced: predicting code structure edges and aligning representations between source code and data flow. The model is pre-trained on the CodeSearchNet dataset and evaluated on four tasks: code search, clone detection, code translation, and code refinement. Results show that GraphCodeBERT achieves state-of-the-art performance on these tasks. Analysis indicates that code structure and the new pre-training tasks significantly improve model performance. The model also shows a preference for structure-level attention over token-level attention in code search tasks.