Evaluating Large Language Models Trained on Code

Evaluating Large Language Models Trained on Code

14 Jul 2021 | Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, Wojciech Zaremba
This paper introduces Codex, a GPT language model fine-tuned on publicly available code from GitHub, and evaluates its Python code-writing capabilities. Codex powers GitHub Copilot and is tested on the HumanEval dataset, which measures functional correctness for synthesizing programs from docstrings. Codex solves 28.8% of the problems on HumanEval, while GPT-3 solves 0% and GPT-J solves 11.4%. Repeated sampling from Codex is surprisingly effective, solving 70.2% of the problems with 100 samples per problem. The model's limitations include difficulty with docstrings describing long chains of operations and binding operations to variables. The paper evaluates the performance of Codex on a dataset of 164 programming problems with unit tests, assessing language comprehension, algorithms, and simple mathematics. Codex-12B solves 28.8% of the problems with a single sample, while Codex-S, fine-tuned on correctly implemented functions, solves 37.7% of the problems. Generating 100 samples per problem and selecting the sample with the highest mean log-probability solves 44.5% of the problems, while selecting the sample that passes unit tests solves 77.5% of the problems. The paper discusses the evaluation framework, including the pass@k metric, which measures the fraction of problems solved by any of the generated samples. The framework includes a sandbox environment for safely executing model-generated code. Codex is fine-tuned on code to produce Codex, which performs well on the HumanEval dataset. The paper also compares Codex with other models like GPT-Neo and Tabnine, finding that Codex outperforms them in terms of pass@k metrics. The paper discusses the limitations of Codex, including its difficulty with long chains of operations and binding operations to variables. It also discusses the broader impacts of deploying powerful code generation technologies, including safety, security, and economic considerations. The paper concludes that while Codex is effective at generating code, it has limitations and potential risks that need to be addressed.This paper introduces Codex, a GPT language model fine-tuned on publicly available code from GitHub, and evaluates its Python code-writing capabilities. Codex powers GitHub Copilot and is tested on the HumanEval dataset, which measures functional correctness for synthesizing programs from docstrings. Codex solves 28.8% of the problems on HumanEval, while GPT-3 solves 0% and GPT-J solves 11.4%. Repeated sampling from Codex is surprisingly effective, solving 70.2% of the problems with 100 samples per problem. The model's limitations include difficulty with docstrings describing long chains of operations and binding operations to variables. The paper evaluates the performance of Codex on a dataset of 164 programming problems with unit tests, assessing language comprehension, algorithms, and simple mathematics. Codex-12B solves 28.8% of the problems with a single sample, while Codex-S, fine-tuned on correctly implemented functions, solves 37.7% of the problems. Generating 100 samples per problem and selecting the sample with the highest mean log-probability solves 44.5% of the problems, while selecting the sample that passes unit tests solves 77.5% of the problems. The paper discusses the evaluation framework, including the pass@k metric, which measures the fraction of problems solved by any of the generated samples. The framework includes a sandbox environment for safely executing model-generated code. Codex is fine-tuned on code to produce Codex, which performs well on the HumanEval dataset. The paper also compares Codex with other models like GPT-Neo and Tabnine, finding that Codex outperforms them in terms of pass@k metrics. The paper discusses the limitations of Codex, including its difficulty with long chains of operations and binding operations to variables. It also discusses the broader impacts of deploying powerful code generation technologies, including safety, security, and economic considerations. The paper concludes that while Codex is effective at generating code, it has limitations and potential risks that need to be addressed.
Reach us at info@study.space
[slides] Evaluating Large Language Models Trained on Code | StudySpace