IRCoder: Intermediate Representations Make Language Models Robust Multilingual Code Generators

IRCoder: Intermediate Representations Make Language Models Robust Multilingual Code Generators

15 Apr 2024 | Indraneil Paul, Goran Glavas, and Iryna Gurevych
This paper introduces IRCoder, a new approach to improve the multilingual capabilities of code generation models (Code-LMs) by leveraging compiler intermediate representations (IR). The authors propose using IR as a shared representation across programming languages to facilitate cross-lingual transfer and enhance code understanding and generation. They create a parallel dataset called SLTrans, consisting of nearly 4 million pairs of self-contained source code files and their corresponding IR. Using this dataset, they train various Code-LMs on IR-grounded data, resulting in significant improvements in performance across multiple code generation tasks and metrics, including prompt robustness, multilingual code completion, code understanding, and instruction following. The study highlights the limitations of current Code-LMs, which are often pre-trained on source code alone and struggle with multilingual code generation due to the skewed distribution of programming languages in code corpora. The authors argue that IR provides a more effective shared representation for cross-lingual transfer, as it is agnostic to the source programming language and target execution platform. They demonstrate that grounding Code-LMs in IR leads to substantial gains in performance, particularly in multilingual code understanding and instruction following. The authors also investigate the effectiveness of IR grounding in improving robustness to prompt perturbations, which is crucial for the reliability of code generation models. They find that IR grounding significantly enhances the robustness of Code-LMs to various types of prompt variations, including code formatting, syntactic variation, and function name mangling. The study shows that IR grounding improves multilingual code understanding and completion, with the best results observed for low-resource languages. The authors also find that IR grounding enhances instruction following, with the largest improvements observed for the strongest Code-LMs. Overall, the paper demonstrates that leveraging IR as a shared representation can significantly improve the multilingual capabilities of Code-LMs, making them more robust and effective across a wide range of programming languages and tasks. The results suggest that IR grounding is a promising approach for improving the performance of Code-LMs in multilingual settings.This paper introduces IRCoder, a new approach to improve the multilingual capabilities of code generation models (Code-LMs) by leveraging compiler intermediate representations (IR). The authors propose using IR as a shared representation across programming languages to facilitate cross-lingual transfer and enhance code understanding and generation. They create a parallel dataset called SLTrans, consisting of nearly 4 million pairs of self-contained source code files and their corresponding IR. Using this dataset, they train various Code-LMs on IR-grounded data, resulting in significant improvements in performance across multiple code generation tasks and metrics, including prompt robustness, multilingual code completion, code understanding, and instruction following. The study highlights the limitations of current Code-LMs, which are often pre-trained on source code alone and struggle with multilingual code generation due to the skewed distribution of programming languages in code corpora. The authors argue that IR provides a more effective shared representation for cross-lingual transfer, as it is agnostic to the source programming language and target execution platform. They demonstrate that grounding Code-LMs in IR leads to substantial gains in performance, particularly in multilingual code understanding and instruction following. The authors also investigate the effectiveness of IR grounding in improving robustness to prompt perturbations, which is crucial for the reliability of code generation models. They find that IR grounding significantly enhances the robustness of Code-LMs to various types of prompt variations, including code formatting, syntactic variation, and function name mangling. The study shows that IR grounding improves multilingual code understanding and completion, with the best results observed for low-resource languages. The authors also find that IR grounding enhances instruction following, with the largest improvements observed for the strongest Code-LMs. Overall, the paper demonstrates that leveraging IR as a shared representation can significantly improve the multilingual capabilities of Code-LMs, making them more robust and effective across a wide range of programming languages and tasks. The results suggest that IR grounding is a promising approach for improving the performance of Code-LMs in multilingual settings.
Reach us at info@study.space