27 Feb 2023 | Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong
**CODEGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis**
Program synthesis aims to generate computer programs from problem specifications, often expressed in natural language or input-output examples. Large language models have advanced this field, but limited training resources and data hinder open access. To democratize this technology, the authors train and release a family of large language models, CODEGen, up to 16.1 billion parameters, on natural and programming language data. They also open-source the training library JAXFORMER. The trained model is competitive with state-of-the-art models on zero-shot Python code generation on HumanEval. The authors further investigate a multi-step paradigm for program synthesis, where a single program is factorized into multiple prompts specifying subproblems. They construct the Multi-Turn Programming Benchmark (MTPB), consisting of 115 diverse problem sets, and show that multi-turn specifications significantly improve program synthesis performance. The training library JAXFORMER and model checkpoints are made available as open-source contributions.
**Contributions:**
- Study of multi-turn program synthesis in autoregressive models under scaling laws.
- Introduction of a multi-turn program synthesis paradigm.
- Quantitative investigation of its properties with a novel multi-turn programming benchmark.
- Open-source release of model checkpoints and the custom training library JAXFORMER.
**Model Training:**
- Standard transformer-based autoregressive language models trained on various sizes (350M, 2.7B, 6.1B, 16.1B parameters) and programming language data.
- Training on datasets: THEPILE, BIGQUERY, and BigPython.
- Development of JAXFORMER for efficient training on Google’s TPU-v4 hardware.
**Datasets:**
- THEPILE: An 825.18 GB English text corpus, including programming language data.
- BIGQUERY: A subset of Google’s BigQuery dataset, containing multiple programming languages.
- BigPython: A large amount of Python code from GitHub.
**Models:**
- Autoregressive transformers with next-token prediction language modeling as the learning objective.
- Training on various sizes and programming language data.
- Architecture follows a standard transformer decoder with left-to-right causal masking and rotary position embedding.
**Evaluation:**
- Single-turn evaluation on HumanEval shows competitive performance with state-of-the-art models.
- Multi-turn evaluation on MTPB demonstrates that multi-turn specifications lead to higher program synthesis performance.
- Larger models and more data improve multi-turn program synthesis capacity.
**Conclusion:**
- The capacity to understand long context and generate coherent responses emerges from scaling up model and data sizes.
- Multi-turn specifications enhance user intent understanding and improve program synthesis quality.
- Open-source contributions facilitate future research and practical applications.**CODEGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis**
Program synthesis aims to generate computer programs from problem specifications, often expressed in natural language or input-output examples. Large language models have advanced this field, but limited training resources and data hinder open access. To democratize this technology, the authors train and release a family of large language models, CODEGen, up to 16.1 billion parameters, on natural and programming language data. They also open-source the training library JAXFORMER. The trained model is competitive with state-of-the-art models on zero-shot Python code generation on HumanEval. The authors further investigate a multi-step paradigm for program synthesis, where a single program is factorized into multiple prompts specifying subproblems. They construct the Multi-Turn Programming Benchmark (MTPB), consisting of 115 diverse problem sets, and show that multi-turn specifications significantly improve program synthesis performance. The training library JAXFORMER and model checkpoints are made available as open-source contributions.
**Contributions:**
- Study of multi-turn program synthesis in autoregressive models under scaling laws.
- Introduction of a multi-turn program synthesis paradigm.
- Quantitative investigation of its properties with a novel multi-turn programming benchmark.
- Open-source release of model checkpoints and the custom training library JAXFORMER.
**Model Training:**
- Standard transformer-based autoregressive language models trained on various sizes (350M, 2.7B, 6.1B, 16.1B parameters) and programming language data.
- Training on datasets: THEPILE, BIGQUERY, and BigPython.
- Development of JAXFORMER for efficient training on Google’s TPU-v4 hardware.
**Datasets:**
- THEPILE: An 825.18 GB English text corpus, including programming language data.
- BIGQUERY: A subset of Google’s BigQuery dataset, containing multiple programming languages.
- BigPython: A large amount of Python code from GitHub.
**Models:**
- Autoregressive transformers with next-token prediction language modeling as the learning objective.
- Training on various sizes and programming language data.
- Architecture follows a standard transformer decoder with left-to-right causal masking and rotary position embedding.
**Evaluation:**
- Single-turn evaluation on HumanEval shows competitive performance with state-of-the-art models.
- Multi-turn evaluation on MTPB demonstrates that multi-turn specifications lead to higher program synthesis performance.
- Larger models and more data improve multi-turn program synthesis capacity.
**Conclusion:**
- The capacity to understand long context and generate coherent responses emerges from scaling up model and data sizes.
- Multi-turn specifications enhance user intent understanding and improve program synthesis quality.
- Open-source contributions facilitate future research and practical applications.