Understanding IRCoder%3A Intermediate Representations Make Language Models Robust Multilingual Code Generators

The paper "IRCoder: Intermediate Representations Make Language Models Robust Multilingual Code Generators" by Indraneil Paul, Goran Glavas, and Iryna Gurevych explores the use of compiler intermediate representations (IR) to enhance the multilingual capabilities of code generation language models (Code-LMs). The authors address the limitations of current Code-LMs, particularly in handling cross-lingual transfer and low-resource languages, by leveraging IR, which is shared across programming languages and provides a common representation for code understanding and generation. Key contributions of the work include: 1. **Dataset Creation**: Development of SLTrans, a parallel dataset consisting of nearly 4 million pairs of self-contained source code files and their corresponding IR. 2. **Model Training**: Conducting continued causal language modeling training on various base Code-LMs (ranging from 1.1B to 7.3B parameters) using SLTrans, forcing the models to learn the IR language and align IR constructs with various programming languages. 3. **Model Evaluation**: Evaluating the resulting models, dubbed IRCoder, on a wide range of code generation tasks and metrics, including prompt robustness, multilingual code completion, code understanding, and instruction following. The study demonstrates that grounding Code-LMs in IR significantly improves their performance across multiple tasks and programming languages. The authors also investigate the effects of IR grounding on robustness to prompt perturbations, multilingual code understanding, and instruction following, showing substantial gains in all areas. The results suggest that IR grounding facilitates cross-lingual transfer and enhances the robustness and generalization of Code-LMs.The paper "IRCoder: Intermediate Representations Make Language Models Robust Multilingual Code Generators" by Indraneil Paul, Goran Glavas, and Iryna Gurevych explores the use of compiler intermediate representations (IR) to enhance the multilingual capabilities of code generation language models (Code-LMs). The authors address the limitations of current Code-LMs, particularly in handling cross-lingual transfer and low-resource languages, by leveraging IR, which is shared across programming languages and provides a common representation for code understanding and generation. Key contributions of the work include: 1. **Dataset Creation**: Development of SLTrans, a parallel dataset consisting of nearly 4 million pairs of self-contained source code files and their corresponding IR. 2. **Model Training**: Conducting continued causal language modeling training on various base Code-LMs (ranging from 1.1B to 7.3B parameters) using SLTrans, forcing the models to learn the IR language and align IR constructs with various programming languages. 3. **Model Evaluation**: Evaluating the resulting models, dubbed IRCoder, on a wide range of code generation tasks and metrics, including prompt robustness, multilingual code completion, code understanding, and instruction following. The study demonstrates that grounding Code-LMs in IR significantly improves their performance across multiple tasks and programming languages. The authors also investigate the effects of IR grounding on robustness to prompt perturbations, multilingual code understanding, and instruction following, showing substantial gains in all areas. The results suggest that IR grounding facilitates cross-lingual transfer and enhances the robustness and generalization of Code-LMs.

IRCoder: Intermediate Representations Make Language Models Robust Multilingual Code Generators

15 Apr 2024 | Indraneil Paul, Goran Glavas, and Iryna Gurevych