Understanding CodeGen%3A An Open Large Language Model for Code with Multi-Turn Program Synthesis

**CODEGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis** Program synthesis aims to generate computer programs from problem specifications, often expressed in natural language or input-output examples. Large language models have advanced this field, but limited training resources and data hinder open access. To democratize this technology, the authors train and release a family of large language models, CODEGen, up to 16.1 billion parameters, on natural and programming language data. They also open-source the training library JAXFORMER. The trained model is competitive with state-of-the-art models on zero-shot Python code generation on HumanEval. The authors further investigate a multi-step paradigm for program synthesis, where a single program is factorized into multiple prompts specifying subproblems. They construct the Multi-Turn Programming Benchmark (MTPB), consisting of 115 diverse problem sets, and show that multi-turn specifications significantly improve program synthesis performance. The training library JAXFORMER and model checkpoints are made available as open-source contributions. **Contributions:** - Study of multi-turn program synthesis in autoregressive models under scaling laws. - Introduction of a multi-turn program synthesis paradigm. - Quantitative investigation of its properties with a novel multi-turn programming benchmark. - Open-source release of model checkpoints and the custom training library JAXFORMER. **Model Training:** - Standard transformer-based autoregressive language models trained on various sizes (350M, 2.7B, 6.1B, 16.1B parameters) and programming language data. - Training on datasets: THEPILE, BIGQUERY, and BigPython. - Development of JAXFORMER for efficient training on Google’s TPU-v4 hardware. **Datasets:** - THEPILE: An 825.18 GB English text corpus, including programming language data. - BIGQUERY: A subset of Google’s BigQuery dataset, containing multiple programming languages. - BigPython: A large amount of Python code from GitHub. **Models:** - Autoregressive transformers with next-token prediction language modeling as the learning objective. - Training on various sizes and programming language data. - Architecture follows a standard transformer decoder with left-to-right causal masking and rotary position embedding. **Evaluation:** - Single-turn evaluation on HumanEval shows competitive performance with state-of-the-art models. - Multi-turn evaluation on MTPB demonstrates that multi-turn specifications lead to higher program synthesis performance. - Larger models and more data improve multi-turn program synthesis capacity. **Conclusion:** - The capacity to understand long context and generate coherent responses emerges from scaling up model and data sizes. - Multi-turn specifications enhance user intent understanding and improve program synthesis quality. - Open-source contributions facilitate future research and practical applications.**CODEGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis** Program synthesis aims to generate computer programs from problem specifications, often expressed in natural language or input-output examples. Large language models have advanced this field, but limited training resources and data hinder open access. To democratize this technology, the authors train and release a family of large language models, CODEGen, up to 16.1 billion parameters, on natural and programming language data. They also open-source the training library JAXFORMER. The trained model is competitive with state-of-the-art models on zero-shot Python code generation on HumanEval. The authors further investigate a multi-step paradigm for program synthesis, where a single program is factorized into multiple prompts specifying subproblems. They construct the Multi-Turn Programming Benchmark (MTPB), consisting of 115 diverse problem sets, and show that multi-turn specifications significantly improve program synthesis performance. The training library JAXFORMER and model checkpoints are made available as open-source contributions. **Contributions:** - Study of multi-turn program synthesis in autoregressive models under scaling laws. - Introduction of a multi-turn program synthesis paradigm. - Quantitative investigation of its properties with a novel multi-turn programming benchmark. - Open-source release of model checkpoints and the custom training library JAXFORMER. **Model Training:** - Standard transformer-based autoregressive language models trained on various sizes (350M, 2.7B, 6.1B, 16.1B parameters) and programming language data. - Training on datasets: THEPILE, BIGQUERY, and BigPython. - Development of JAXFORMER for efficient training on Google’s TPU-v4 hardware. **Datasets:** - THEPILE: An 825.18 GB English text corpus, including programming language data. - BIGQUERY: A subset of Google’s BigQuery dataset, containing multiple programming languages. - BigPython: A large amount of Python code from GitHub. **Models:** - Autoregressive transformers with next-token prediction language modeling as the learning objective. - Training on various sizes and programming language data. - Architecture follows a standard transformer decoder with left-to-right causal masking and rotary position embedding. **Evaluation:** - Single-turn evaluation on HumanEval shows competitive performance with state-of-the-art models. - Multi-turn evaluation on MTPB demonstrates that multi-turn specifications lead to higher program synthesis performance. - Larger models and more data improve multi-turn program synthesis capacity. **Conclusion:** - The capacity to understand long context and generate coherent responses emerges from scaling up model and data sizes. - Multi-turn specifications enhance user intent understanding and improve program synthesis quality. - Open-source contributions facilitate future research and practical applications.

CODEGEN: AN OPEN LARGE LANGUAGE MODEL FOR CODE WITH MULTI-TURN PROGRAM SYNTHESIS

27 Feb 2023 | Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong