CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation

CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation

16 Mar 2021 | Shuai Lu*, Daya Guo*, Shuo Ren*, Junjie Huang*, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, Ge Li, Lidong Zhou, Linjun Shou, Long Zhou, Michele Tufano, Ming Gong, Ming Zhou, Nan Duan, Neel Sundaresan, Shao Kun Deng, Shengyu Fu, Shujie Liu
CodeXGLUE is a benchmark dataset designed to advance research in program understanding and generation. It includes 14 datasets, 10 diverse programming language tasks, and a platform for model evaluation and comparison. The dataset features three baseline systems: BERT-style, GPT-style, and Encoder-Decoder models. CodeXGLUE supports tasks such as code clone detection, defect detection, cloze test, code completion, code translation, code search, code repair, text-to-code generation, code summarization, and documentation translation. The datasets are chosen or created based on the criterion that they support the development and evaluation of data-driven machine learning methods. CodeXGLUE includes both previously proposed datasets and newly introduced ones. The dataset provides three baseline models to help perform the tasks, including CodeBERT for code understanding, CodeGPT for completion and generation, and an Encoder-Decoder framework for sequence-to-sequence generation. The benchmark aims to accelerate research in programming language tasks by providing a comprehensive set of tasks and datasets for model evaluation and comparison.CodeXGLUE is a benchmark dataset designed to advance research in program understanding and generation. It includes 14 datasets, 10 diverse programming language tasks, and a platform for model evaluation and comparison. The dataset features three baseline systems: BERT-style, GPT-style, and Encoder-Decoder models. CodeXGLUE supports tasks such as code clone detection, defect detection, cloze test, code completion, code translation, code search, code repair, text-to-code generation, code summarization, and documentation translation. The datasets are chosen or created based on the criterion that they support the development and evaluation of data-driven machine learning methods. CodeXGLUE includes both previously proposed datasets and newly introduced ones. The dataset provides three baseline models to help perform the tasks, including CodeBERT for code understanding, CodeGPT for completion and generation, and an Encoder-Decoder framework for sequence-to-sequence generation. The benchmark aims to accelerate research in programming language tasks by providing a comprehensive set of tasks and datasets for model evaluation and comparison.
Reach us at info@study.space
[slides and audio] CodeXGLUE%3A A Machine Learning Benchmark Dataset for Code Understanding and Generation