Exploring Multi-Lingual Bias of Large Code Models in Code Generation

Exploring Multi-Lingual Bias of Large Code Models in Code Generation

30 Apr 2024 | Chaozheng Wang, Zongjie Li, Cuiyun Gao, Wenxuan Wang, Ting Peng, Hailiang Huang, Yuetang Deng, Shuai Wang, Michael R. Lyu
This paper investigates the multi-lingual bias in large code models (LCMs) during code generation. The study reveals that LCMs exhibit significant bias in generating code when instructions are provided in different natural languages (NLs) and programming languages (PLs). For example, when given Chinese instructions, the code generation performance of LCMs drops by at least 13% in terms of the Pass@1 metric. Additionally, LCMs perform variably across different programming languages, with a performance gap of up to 23.7% between Python and C++. To address this issue, the paper proposes a multi-lingual evaluation benchmark called X-HumanEval-X, which includes instructions in two NLs (English and Chinese) and solutions in three PLs (Python, Java, and C++). The study also constructs a multi-lingual instruction tuning dataset called MEIC, which contains code generation instructions and their corresponding answers in two NLs and over twenty PLs. The paper explores various methods to mitigate the multi-lingual bias, including translation-based prompting strategies and instruction tuning. The results show that translation-based prompting strategies, such as one-step and multi-step translation, significantly reduce the multi-NL bias. Additionally, instruction tuning with the MEIC dataset substantially reduces the multi-lingual bias, with the multi-NL bias decreasing by up to 84% and the multi-PL bias by up to 40%. The study also finds that instruction tuning improves the code generation performance of LCMs, with the Pass@1 metric increasing by 31% to 46%. The findings suggest that increasing the diversity of NLs and PLs in the training data can enhance the overall performance of LCMs and reduce the multi-lingual bias. The paper concludes that multi-lingual bias is a significant issue in LCMs, and further research is needed to address this problem.This paper investigates the multi-lingual bias in large code models (LCMs) during code generation. The study reveals that LCMs exhibit significant bias in generating code when instructions are provided in different natural languages (NLs) and programming languages (PLs). For example, when given Chinese instructions, the code generation performance of LCMs drops by at least 13% in terms of the Pass@1 metric. Additionally, LCMs perform variably across different programming languages, with a performance gap of up to 23.7% between Python and C++. To address this issue, the paper proposes a multi-lingual evaluation benchmark called X-HumanEval-X, which includes instructions in two NLs (English and Chinese) and solutions in three PLs (Python, Java, and C++). The study also constructs a multi-lingual instruction tuning dataset called MEIC, which contains code generation instructions and their corresponding answers in two NLs and over twenty PLs. The paper explores various methods to mitigate the multi-lingual bias, including translation-based prompting strategies and instruction tuning. The results show that translation-based prompting strategies, such as one-step and multi-step translation, significantly reduce the multi-NL bias. Additionally, instruction tuning with the MEIC dataset substantially reduces the multi-lingual bias, with the multi-NL bias decreasing by up to 84% and the multi-PL bias by up to 40%. The study also finds that instruction tuning improves the code generation performance of LCMs, with the Pass@1 metric increasing by 31% to 46%. The findings suggest that increasing the diversity of NLs and PLs in the training data can enhance the overall performance of LCMs and reduce the multi-lingual bias. The paper concludes that multi-lingual bias is a significant issue in LCMs, and further research is needed to address this problem.
Reach us at info@study.space