24 Jun 2024 | Tao Sun, Linzheng Chai, Jian Yang, Yuwei Yin, Hongcheng Guo, Jiaheng Liu, Bing Wang, Liqun Yang, Zhoujun Li
This paper introduces UNICODER, a code generation model that uses universal code (UniCode) as an intermediate representation to improve code generation and translation. UniCode is a description of algorithm steps using a mix of programming language conventions, such as assignment operators, conditional operators, and loops. The authors collect an instruction dataset, UNICODER-INSTRUCT, containing natural-language questions, code solutions, and corresponding universal code to train their model. The model is trained on multi-task learning objectives, including question-answer generation, question-universal-code generation, universal-code-answer translation, and Universal-code-of-Thought (UoT) objectives. The results show that UNICODER significantly outperforms previous prompting methods in code generation and translation tasks. The model is evaluated on benchmark datasets such as HumanEval, MBPP, and MultiPL-E, demonstrating its effectiveness across multiple programming languages. The authors also conduct an ablation study to verify the efficacy of their method and discuss the impact of different universal code formats on model performance. The study highlights the importance of using structured intermediate representations in code generation and shows that UNICODER achieves state-of-the-art performance in code generation and translation tasks.This paper introduces UNICODER, a code generation model that uses universal code (UniCode) as an intermediate representation to improve code generation and translation. UniCode is a description of algorithm steps using a mix of programming language conventions, such as assignment operators, conditional operators, and loops. The authors collect an instruction dataset, UNICODER-INSTRUCT, containing natural-language questions, code solutions, and corresponding universal code to train their model. The model is trained on multi-task learning objectives, including question-answer generation, question-universal-code generation, universal-code-answer translation, and Universal-code-of-Thought (UoT) objectives. The results show that UNICODER significantly outperforms previous prompting methods in code generation and translation tasks. The model is evaluated on benchmark datasets such as HumanEval, MBPP, and MultiPL-E, demonstrating its effectiveness across multiple programming languages. The authors also conduct an ablation study to verify the efficacy of their method and discuss the impact of different universal code formats on model performance. The study highlights the importance of using structured intermediate representations in code generation and shows that UNICODER achieves state-of-the-art performance in code generation and translation tasks.