16 Aug 2021 | Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, Charles Sutton
This paper explores the limits of large language models (LLMs) in program synthesis for general-purpose programming languages. We evaluate several LLMs (with 244M to 137B parameters) on two new benchmarks, MBPP and MathQA-Python, in both few-shot and fine-tuning regimes. MBPP contains 974 programming tasks for entry-level programmers, while MathQA-Python contains 23,914 problems testing code synthesis from complex text. Results show that synthesis performance scales log-linearly with model size. The largest models can solve 59.6% of MBPP problems using few-shot learning, and fine-tuning improves performance by about 10 percentage points. On MathQA-Python, the largest fine-tuned model achieves 83.8% accuracy.
We also study the model's ability to engage in dialog with humans to improve code solutions. Human feedback reduces error rates by up to 50%. Error analysis reveals that models struggle with complex programs and semantic understanding. Semantic grounding experiments show that even the largest models cannot predict program outputs given specific inputs. Performance is sensitive to model size, number of examples, and prompt details. Solutions often generalize to held-out test cases, but some models overfit to test cases. Pre-training/test overlap is minimal, suggesting that models do not rely on memorization.
The edited MBPP dataset shows improved performance, with 66.4% of problems solved. Human-model collaboration significantly improves synthesis performance, with 4 dialog turns increasing solve rates from 30% to 65%. Qualitative analysis reveals that models struggle with multi-step problems, common-sibling problems, and semantic errors. Human feedback helps clarify ambiguous prompts and correct code errors. Overall, LLMs show promise in program synthesis but have limitations in understanding and generalization.This paper explores the limits of large language models (LLMs) in program synthesis for general-purpose programming languages. We evaluate several LLMs (with 244M to 137B parameters) on two new benchmarks, MBPP and MathQA-Python, in both few-shot and fine-tuning regimes. MBPP contains 974 programming tasks for entry-level programmers, while MathQA-Python contains 23,914 problems testing code synthesis from complex text. Results show that synthesis performance scales log-linearly with model size. The largest models can solve 59.6% of MBPP problems using few-shot learning, and fine-tuning improves performance by about 10 percentage points. On MathQA-Python, the largest fine-tuned model achieves 83.8% accuracy.
We also study the model's ability to engage in dialog with humans to improve code solutions. Human feedback reduces error rates by up to 50%. Error analysis reveals that models struggle with complex programs and semantic understanding. Semantic grounding experiments show that even the largest models cannot predict program outputs given specific inputs. Performance is sensitive to model size, number of examples, and prompt details. Solutions often generalize to held-out test cases, but some models overfit to test cases. Pre-training/test overlap is minimal, suggesting that models do not rely on memorization.
The edited MBPP dataset shows improved performance, with 66.4% of problems solved. Human-model collaboration significantly improves synthesis performance, with 4 dialog turns increasing solve rates from 30% to 65%. Qualitative analysis reveals that models struggle with multi-step problems, common-sibling problems, and semantic errors. Human feedback helps clarify ambiguous prompts and correct code errors. Overall, LLMs show promise in program synthesis but have limitations in understanding and generalization.