16 Aug 2021 | Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, Charles Sutton
This paper explores the capabilities of large language models for program synthesis in general-purpose programming languages. The authors evaluate a range of models with parameters between 244 million and 137 billion on two new benchmarks: MBPP and MathQA-Python. MBPP contains 974 programming tasks designed for entry-level programmers, while MathQA-Python is a Python version of the MathQA benchmark with 23,914 problems. The study finds that synthesis performance scales log-linearly with model size, and fine-tuning further improves performance. The largest models can synthesize solutions to 59.6% of MBPP problems using few-shot learning. Fine-tuning on a small subset of the dataset improves performance by about 10 percentage points. On MathQA-Python, the largest model achieves 83.8% accuracy. The paper also examines the model's ability to engage in dialog with humans, finding that natural language feedback reduces error rates by half. Additionally, the study explores the semantic grounding of the models, showing that they generally cannot predict the output of a program given a specific input. The paper concludes with an analysis of the sensitivity of performance to various factors and a discussion of potential criticisms of large language models in program synthesis.This paper explores the capabilities of large language models for program synthesis in general-purpose programming languages. The authors evaluate a range of models with parameters between 244 million and 137 billion on two new benchmarks: MBPP and MathQA-Python. MBPP contains 974 programming tasks designed for entry-level programmers, while MathQA-Python is a Python version of the MathQA benchmark with 23,914 problems. The study finds that synthesis performance scales log-linearly with model size, and fine-tuning further improves performance. The largest models can synthesize solutions to 59.6% of MBPP problems using few-shot learning. Fine-tuning on a small subset of the dataset improves performance by about 10 percentage points. On MathQA-Python, the largest model achieves 83.8% accuracy. The paper also examines the model's ability to engage in dialog with humans, finding that natural language feedback reduces error rates by half. Additionally, the study explores the semantic grounding of the models, showing that they generally cannot predict the output of a program given a specific input. The paper concludes with an analysis of the sensitivity of performance to various factors and a discussion of potential criticisms of large language models in program synthesis.